Go to
previous page /
next page of CMS site log
11. 11. 2010 CMS production jobs starving in the queue
Qtop output for currently running jobs:
===> Worker Nodes occupancy <=== (you can read vertically the node IDs; nodes in free state are noted with - )
|000000000111111111122222|222223333333333444444444|455555555556666666666777|777777788888888889999999|={_Worker_}|
|123456789012345678901234|567890123456789012345678|901234567890123456789012|345678901234567890123456|={__Node__}|
|------------------------|----------------o-------|------------------------|------------------------|=Node state|
|B9916BC6676671AB799696A6|4CC914996DC647AA_A911AA9|176677D17B676917DBB6C797|B167BAB7B9B6671946171966|=CPU0
|7111967977967661BC6C77A1|1176C697A4AC671A_C76A169|66611A7176717C6A1DAA11C7|97B696761777CA77B6B67711|=CPU1
|691477191777679967A676B6|71A1116B11A76161_6911461|C7A71166911BC791C7D646B7|16B19769A76AC769AC677766|=CPU2
|CCA64D19AB1B7DDCC6C6666A|BB6417A9671D169A_69A1AB6|BD9A4696716966667991C666|1C7B7BB77ADMAD1AB1767CC6|=CPU3
|769BD674616AD7CCB9A96B61|146176D7B117C716_7DBA679|A66_976977C1A67A719CCC67|9777A61DCA66B77B761B76C1|=CPU4
|9C4D_16A617661C16B116717|91761176D6766979_96AC797|A9_CD6_77177119676ABC9J7|67A6B6CB1C6AC6A1D9717476|=CPU5
|AC16A1B11C6AB196_BB61B17|1DD1996_11199711_C711B16|761A16677A41677BBC66A976|7696179776766C9C679B6A16|=CPU6
|7C677167199647191A11C16B|_61C1A1A11BA764C_7_6A_7B|B1A6979114CD7CA1CB7C6167|7B666B1DBC61916117779CAC|=CPU7
|AA4B171BA16A1CC111166641|1AB917671B1611C6_1D91B71|14C67A97DC17667111171A67|1B667676_1B179C76B767776|=CPU8
|47946D6D99DD761111BC7711|61111D67A1BD19A1_B1111DA|C1A6CB_AA69A9A1B411616B1|6A7C61BC9A6C117B7696997C|=CPU9
|DC17A4144A9D714411_4D66B|6D674166711B117C___D6616|9_B1A7_CD6B_6677C97717_9|7A167D196C6BC77117976BAD|=CPU10
|D_7117119DD14619A1D16DB1|C61161D_6161_41B_D_676D1|_6DC44_177A1__7B611BB66_|1471A6C46_16AA7771__61A6|=CPU11
|________________________|________________________|________________________|________________________|=CPU12
|________________________|________________________|________________________|________________________|=CPU13
|________________________|________________________|________________________|________________________|=CPU14
|________________________|________________________|________________________|________________________|=CPU15
===> User accounts and pool mappings <=== ('all' includes those in C and W states, as reported by qstat)
id| R + Q / all | unix account | Grid certificate DN (this info is only available under elevated privileges)
0 | 0 +1197 /1197 | cms107 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=malbouis/cn=411828/cn=helena brandao malbouisson:cms
1 |231 +489 / 721 | atlasprd |
2 | 0 +443 / 443 | cms228 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=trocino/cn=668497/cn=daniele trocino:cms
3 | 0 +400 / 400 | prdcms42 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=vdutta/cn=696509/cn=valentina dutta:prdcms
4 | 38 +320 / 361 | atlasplt |
5 | 0 +264 / 264 | prdcms11 | /c=be/o=begrid/ou=fynu/ou=ucl/cn=nicolas schul:prdcms
6 |228 + 8 / 236 | cms034 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot47/glidein-1.t2.ucsd.edu
7 |199 + 8 / 207 | cms485 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot49/glidein-1.t2.ucsd.edu
8 | 0 +200 / 200 | prdcms24 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=asciaba/cn=430796/cn=andrea sciaba:prdcms
9 | 98 + 4 / 102 | cms352 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot10/glidein-1.t2.ucsd.edu
A | 94 + 4 / 98 | cms430 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot07/glidein-1.t2.ucsd.edu
B | 86 + 4 / 90 | cms139 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot05/glidein-1.t2.ucsd.edu
C | 83 + 4 / 87 | cms023 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot09/glidein-1.t2.ucsd.edu
D | 51 + 14 / 66 | lhcbplt01 | /dc=es/dc=irisgrid/o=ecm-ub/cn=ricardo-graciani-diaz:lhcbplt
E | 0 + 6 / 6 | honeprd |
F | 0 + 5 / 5 | prdcms01 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=sciaba/cn=430796/cn=andrea sciaba:prdcms
G | 0 + 5 / 5 | opssgm |
H | 0 + 5 / 5 | lhcbplt09 | /o=grid-fr/c=fr/o=cnrs/ou=cppm/cn=andrei tsaregorodtsev:lhcbplt
I | 0 + 3 / 3 | atlassgm |
J | 1 + 2 / 3 | atlas107 | /o=germangrid/ou=lmu/cn=johannes elmsheuser:atlas
K | 0 + 2 / 2 | lhcbprd01 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=santinel/cn=564059/cn=roberto santinelli:lhcb
L | 0 + 2 / 2 | cmssgm |
M | 1 + 0 / 1 | cms429 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=tjkim/cn=569507/cn=tae jeong kim:cms
It turns out that the most recent current jobs have been running since yesterday around 21:40:14
-bash-3.2$ grep cms showq.log | grep Running | tail -n 10
1998991 cms034 Running 1 2:07:58:48 Wed Nov 10 21:40:14
1998992 cms139 Running 1 2:07:58:48 Wed Nov 10 21:40:14
1998993 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14
1998994 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14
1998995 cms034 Running 1 2:07:58:48 Wed Nov 10 21:40:14
1998997 cms139 Running 1 2:07:58:48 Wed Nov 10 21:40:14
1998999 cms430 Running 1 2:07:58:48 Wed Nov 10 21:40:14
1999044 cms034 Running 1 2:07:58:48 Wed Nov 10 21:40:14
1999045 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14
1999047 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14
How long have the prdcms jobs been waiting in the queue:
bash-3.2$ grep cms showq.log | grep prdcms | head -n 10
1992979 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47
1992980 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47
1992981 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47
1992982 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47
1992983 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:48
1992984 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:48
1992985 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:53
1992986 prdcms42 Idle 1 3:00:00:00 Wed Nov 10 10:26:56
1992987 prdcms42 Idle 1 3:00:00:00 Wed Nov 10 10:26:56
1992988 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:27:49
-bash-3.2$ grep cms showq.log | grep prdcms | tail -n 10
2001747 prdcms24 Idle 1 3:00:00:00 Thu Nov 11 09:08:46
2001748 prdcms24 Idle 1 3:00:00:00 Thu Nov 11 09:08:46
2001749 prdcms24 Idle 1 3:00:00:00 Thu Nov 11 09:08:46
2001757 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 09:13:02
2002095 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34
2002096 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34
2002097 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34
2002098 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34
2002541 prdcms42 Idle 1 3:00:00:00 Thu Nov 11 12:41:54
2002542 prdcms42 Idle 1 3:00:00:00 Thu Nov 11 12:41:59
Running CMS jobs run with efficiencies from 50 to 95 percent.
-bash-3.2$ grep cms showqr.log | sort -k 4 -r | head
1998539 R lrm 95.21 1.0 - cms485 cms wn189 1 2:07:06:16 Wed Nov 10 20:54:13
1998536 R lrm 95.16 1.0 - cms430 cms wn169 1 2:07:06:16 Wed Nov 10 20:54:13
1998631 R lrm 95.09 1.0 - cms485 cms wn172 1 2:07:21:58 Wed Nov 10 21:09:55
1998161 R lrm 94.87 1.0 - cms139 cms wn106 1 2:05:24:51 Wed Nov 10 19:12:48
1998596 R lrm 94.79 1.0 - cms034 cms wn135 1 2:07:14:10 Wed Nov 10 21:02:07
1998533 R lrm 94.75 1.0 - cms352 cms wn110 1 2:07:03:36 Wed Nov 10 20:51:33
1998582 R lrm 94.63 1.0 - cms034 cms wn178 1 2:07:14:10 Wed Nov 10 21:02:07
1998595 R lrm 94.59 1.0 - cms034 cms wn137 1 2:07:14:10 Wed Nov 10 21:02:07
1998532 R lrm 94.56 1.0 - cms023 cms wn116 1 2:07:03:36 Wed Nov 10 20:51:33
1998630 R lrm 94.45 1.0 - cms485 cms wn106 1 2:07:19:25 Wed Nov 10 21:07:22
-bash-3.2$ grep cms showqr.log | sort -k 4 -r | tail
1996470 R lrm 56.38 1.0 - cms352 cms wn121 1 2:02:14:33 Wed Nov 10 16:02:30
1996390 R lrm 55.59 1.0 - cms430 cms wn135 1 2:02:06:16 Wed Nov 10 15:54:13
1996750 R lrm 54.65 1.0 - cms034 cms wn112 1 2:02:44:12 Wed Nov 10 16:32:09
1996466 R lrm 54.41 1.0 - cms430 cms wn109 1 2:02:11:35 Wed Nov 10 15:59:32
1995825 R lrm 54.34 1.0 - cms485 cms wn172 1 2:01:27:11 Wed Nov 10 15:15:08
1995824 R lrm 54.15 1.0 - cms430 cms wn181 1 2:01:27:11 Wed Nov 10 15:15:08
1995826 R lrm 53.61 1.0 - cms485 cms wn165 1 2:01:27:11 Wed Nov 10 15:15:08
1995823 R lrm 53.21 1.0 - cms430 cms wn182 1 2:01:27:11 Wed Nov 10 15:15:08
1995821 R lrm 52.67 1.0 - cms485 cms wn143 1 2:01:27:11 Wed Nov 10 15:15:08
1996375 R lrm 47.99 1.0 - cms034 cms wn171 1 2:02:06:16 Wed Nov 10 15:54:13
Fairshare information
diagnose -f
FairShare Information
Depth: 7 intervals Interval Length: 1:00:00:00 Decay Rate: 0.70
FS Policy: DEDICATEDPS
System FS Settings: Target Usage: 0.00 Flags: 0
FSInterval % Target 0 1 2 3 4 5 6
FSWeight ------- ------- 1.0000 0.7000 0.4900 0.3430 0.2401 0.1681 0.1176
TotalUsage 100.00 ------- 14715.8 24937.3 14253.4 70.4 12515.5 26208.6 26610.8
...
...
GROUP
-------------
atlas 40.06 ------- 25.60 40.09 33.79 ------- 64.83 72.68 52.49
cms 37.82 ------- 71.11 30.68 14.52 51.02 8.18 15.30 33.19
pricms* 0.00 10.00 ------- ------- ------- ------- 0.00 ------- -------
prdcms 18.85 20.00 ------- 27.01 45.12 47.82 26.80 6.58 12.81
dech 0.00 ------- ------- ------- 0.00 ------- ------- ------- -------
lhcb 0.00 ------- 0.01 0.00 0.01 ------- ------- 0.00 0.00
hone 0.08 ------- 0.06 0.11 0.07 ------- 0.04 0.10 0.11
ops 0.01 ------- 0.00 0.01 0.01 1.16 0.03 0.02 0.03
lhcbplt* 3.16 18.00 3.22 2.09 6.49 ------- 0.11 5.33 1.37
ACCT
-------------
QOS
-------------
CLASS
-------------
atlas 40.05 40.00 25.59 40.08 33.78 ------- 64.81 72.65 52.48
cms* 56.67 40.00 71.11 57.68 59.63 98.84 34.95 21.85 45.97
lhcb* 3.17 20.00 3.23 2.09 6.50 ------- 0.11 5.33 1.37
cscs 0.00 ------- ------- ------- 0.00 ------- ------- ------- -------
lcgadmin 0.03 ------- 0.01 0.03 0.02 ------- 0.06 0.05 0.05
ops 0.01 ------- 0.00 0.01 0.01 1.16 0.03 0.02 0.03
other 0.08 ------- 0.06 0.11 0.07 ------- 0.04 0.10 0.11
CMS dashboard reports the following production job usage over the last days
Checking that the fairshare calculation is correct
(0*pow(0.7,0)*14715.8 + 27.01*pow(0.7,1)*24937.3 + 45.12*pow(0.7,2)*14253.4 + 47.82*pow(0.7,3)*70.4 + 26.80*pow(0.7,4)*12515.5 +
6.58*pow(0.7,5)*26208.6 +12.81*pow(0.7,6)*26610.8)/ (pow(0.7,0)*14715.8 + pow(0.7,1)*24937.3 + pow(0.7,2)*14253.4 + pow(0.7,3)*70.4 +
pow(0.7,4)*12515.5 + pow(0.7,5)*26208.6 + pow(0.7,6)*26610.8) = 18.853110024838912
Trying to find out more about the first queued job
Nov 11 14:11 [root@lrms02:~]# checkjob 1992979
job 1992979
AName: STDIN
State: Idle
Creds: user:prdcms11 group:prdcms class:cms
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Wed Nov 10 10:26:47
(Time Queued Total: 1:03:51:56 Eligible: 1:02:39:19)
Total Requested Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Dedicated Resources Per Task: PROCS: 1 MEM: 2000M
Flags: RESTARTABLE,FSVIOLATION
Attr: FSVIOLATION,checkpoint
StartPriority: -1301
NOTE: job req cannot run in partition lrms02 (available procs do not meet requirements : 0 of 1 procs found)
idle procs: 397 feasible procs: 0
Node Rejection Summary: [Memory: 84][State: 1][Reserved: 11]
Nov 11 14:20 [root@lrms02:~]# showstart 1992979
job 1992979 requires 1 proc for 3:00:00:00
Estimated Rsv based start in 21:41:07 on Fri Nov 12 12:02:31
Estimated Rsv based completion in 3:21:41:07 on Mon Nov 15 12:02:31
Best Partition: lrms02
--
DerekFeichtinger - 2010-11-11
Go to
previous page /
next page of CMS site log
- dashboard-20101111.png: