Arrow left Go to previous page / next page of CMS site log MOVED TO...

11. 11. 2010 CMS production jobs starving in the queue

Qtop output for currently running jobs:

===> Worker Nodes occupancy <=== (you can read vertically the node IDs; nodes in free state are noted with - )
|000000000111111111122222|222223333333333444444444|455555555556666666666777|777777788888888889999999|={_Worker_}|
|123456789012345678901234|567890123456789012345678|901234567890123456789012|345678901234567890123456|={__Node__}|
|------------------------|----------------o-------|------------------------|------------------------|=Node state|
|B9916BC6676671AB799696A6|4CC914996DC647AA_A911AA9|176677D17B676917DBB6C797|B167BAB7B9B6671946171966|=CPU0
|7111967977967661BC6C77A1|1176C697A4AC671A_C76A169|66611A7176717C6A1DAA11C7|97B696761777CA77B6B67711|=CPU1
|691477191777679967A676B6|71A1116B11A76161_6911461|C7A71166911BC791C7D646B7|16B19769A76AC769AC677766|=CPU2
|CCA64D19AB1B7DDCC6C6666A|BB6417A9671D169A_69A1AB6|BD9A4696716966667991C666|1C7B7BB77ADMAD1AB1767CC6|=CPU3
|769BD674616AD7CCB9A96B61|146176D7B117C716_7DBA679|A66_976977C1A67A719CCC67|9777A61DCA66B77B761B76C1|=CPU4
|9C4D_16A617661C16B116717|91761176D6766979_96AC797|A9_CD6_77177119676ABC9J7|67A6B6CB1C6AC6A1D9717476|=CPU5
|AC16A1B11C6AB196_BB61B17|1DD1996_11199711_C711B16|761A16677A41677BBC66A976|7696179776766C9C679B6A16|=CPU6
|7C677167199647191A11C16B|_61C1A1A11BA764C_7_6A_7B|B1A6979114CD7CA1CB7C6167|7B666B1DBC61916117779CAC|=CPU7
|AA4B171BA16A1CC111166641|1AB917671B1611C6_1D91B71|14C67A97DC17667111171A67|1B667676_1B179C76B767776|=CPU8
|47946D6D99DD761111BC7711|61111D67A1BD19A1_B1111DA|C1A6CB_AA69A9A1B411616B1|6A7C61BC9A6C117B7696997C|=CPU9
|DC17A4144A9D714411_4D66B|6D674166711B117C___D6616|9_B1A7_CD6B_6677C97717_9|7A167D196C6BC77117976BAD|=CPU10
|D_7117119DD14619A1D16DB1|C61161D_6161_41B_D_676D1|_6DC44_177A1__7B611BB66_|1471A6C46_16AA7771__61A6|=CPU11
|________________________|________________________|________________________|________________________|=CPU12
|________________________|________________________|________________________|________________________|=CPU13
|________________________|________________________|________________________|________________________|=CPU14
|________________________|________________________|________________________|________________________|=CPU15

===> User accounts and pool mappings <=== ('all' includes those in C and W states, as reported by qstat)
id|  R + Q  / all |  unix account  | Grid certificate DN (this info is only available under elevated privileges)
0 |  0 +1197 /1197 |         cms107 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=malbouis/cn=411828/cn=helena brandao malbouisson:cms
1 |231 +489 / 721 |       atlasprd | 
2 |  0 +443 / 443 |         cms228 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=trocino/cn=668497/cn=daniele trocino:cms
3 |  0 +400 / 400 |       prdcms42 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=vdutta/cn=696509/cn=valentina dutta:prdcms
4 | 38 +320 / 361 |       atlasplt | 
5 |  0 +264 / 264 |       prdcms11 |  /c=be/o=begrid/ou=fynu/ou=ucl/cn=nicolas schul:prdcms
6 |228 +  8 / 236 |         cms034 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot47/glidein-1.t2.ucsd.edu
7 |199 +  8 / 207 |         cms485 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot49/glidein-1.t2.ucsd.edu
8 |  0 +200 / 200 |       prdcms24 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=asciaba/cn=430796/cn=andrea sciaba:prdcms
9 | 98 +  4 / 102 |         cms352 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot10/glidein-1.t2.ucsd.edu
A | 94 +  4 /  98 |         cms430 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot07/glidein-1.t2.ucsd.edu
B | 86 +  4 /  90 |         cms139 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot05/glidein-1.t2.ucsd.edu
C | 83 +  4 /  87 |         cms023 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot09/glidein-1.t2.ucsd.edu
D | 51 + 14 /  66 |      lhcbplt01 |  /dc=es/dc=irisgrid/o=ecm-ub/cn=ricardo-graciani-diaz:lhcbplt
E |  0 +  6 /   6 |        honeprd | 
F |  0 +  5 /   5 |       prdcms01 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=sciaba/cn=430796/cn=andrea sciaba:prdcms
G |  0 +  5 /   5 |         opssgm | 
H |  0 +  5 /   5 |      lhcbplt09 |  /o=grid-fr/c=fr/o=cnrs/ou=cppm/cn=andrei tsaregorodtsev:lhcbplt
I |  0 +  3 /   3 |       atlassgm | 
J |  1 +  2 /   3 |       atlas107 |  /o=germangrid/ou=lmu/cn=johannes elmsheuser:atlas
K |  0 +  2 /   2 |      lhcbprd01 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=santinel/cn=564059/cn=roberto santinelli:lhcb
L |  0 +  2 /   2 |         cmssgm | 
M |  1 +  0 /   1 |         cms429 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=tjkim/cn=569507/cn=tae jeong kim:cms

It turns out that the most recent current jobs have been running since yesterday around 21:40:14

-bash-3.2$ grep cms showq.log | grep Running | tail -n 10
1998991              cms034    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998992              cms139    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998993              cms485    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998994              cms485    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998995              cms034    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998997              cms139    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998999              cms430    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1999044              cms034    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1999045              cms485    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1999047              cms485    Running     1  2:07:58:48  Wed Nov 10 21:40:14

How long have the prdcms jobs been waiting in the queue:

bash-3.2$ grep cms showq.log | grep prdcms | head -n 10
1992979            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:47
1992980            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:47
1992981            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:47
1992982            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:47
1992983            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:48
1992984            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:48
1992985            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:53
1992986            prdcms42       Idle     1  3:00:00:00  Wed Nov 10 10:26:56
1992987            prdcms42       Idle     1  3:00:00:00  Wed Nov 10 10:26:56
1992988            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:27:49

-bash-3.2$ grep cms showq.log | grep prdcms | tail -n 10
2001747            prdcms24       Idle     1  3:00:00:00  Thu Nov 11 09:08:46
2001748            prdcms24       Idle     1  3:00:00:00  Thu Nov 11 09:08:46
2001749            prdcms24       Idle     1  3:00:00:00  Thu Nov 11 09:08:46
2001757            prdcms11       Idle     1  3:00:00:00  Thu Nov 11 09:13:02
2002095            prdcms11       Idle     1  3:00:00:00  Thu Nov 11 11:53:34
2002096            prdcms11       Idle     1  3:00:00:00  Thu Nov 11 11:53:34
2002097            prdcms11       Idle     1  3:00:00:00  Thu Nov 11 11:53:34
2002098            prdcms11       Idle     1  3:00:00:00  Thu Nov 11 11:53:34
2002541            prdcms42       Idle     1  3:00:00:00  Thu Nov 11 12:41:54
2002542            prdcms42       Idle     1  3:00:00:00  Thu Nov 11 12:41:59

Running CMS jobs run with efficiencies from 50 to 95 percent.

-bash-3.2$ grep cms showqr.log | sort -k 4 -r | head
1998539             R  lrm  95.21      1.0  -    cms485      cms            wn189     1  2:07:06:16  Wed Nov 10 20:54:13
1998536             R  lrm  95.16      1.0  -    cms430      cms            wn169     1  2:07:06:16  Wed Nov 10 20:54:13
1998631             R  lrm  95.09      1.0  -    cms485      cms            wn172     1  2:07:21:58  Wed Nov 10 21:09:55
1998161             R  lrm  94.87      1.0  -    cms139      cms            wn106     1  2:05:24:51  Wed Nov 10 19:12:48
1998596             R  lrm  94.79      1.0  -    cms034      cms            wn135     1  2:07:14:10  Wed Nov 10 21:02:07
1998533             R  lrm  94.75      1.0  -    cms352      cms            wn110     1  2:07:03:36  Wed Nov 10 20:51:33
1998582             R  lrm  94.63      1.0  -    cms034      cms            wn178     1  2:07:14:10  Wed Nov 10 21:02:07
1998595             R  lrm  94.59      1.0  -    cms034      cms            wn137     1  2:07:14:10  Wed Nov 10 21:02:07
1998532             R  lrm  94.56      1.0  -    cms023      cms            wn116     1  2:07:03:36  Wed Nov 10 20:51:33
1998630             R  lrm  94.45      1.0  -    cms485      cms            wn106     1  2:07:19:25  Wed Nov 10 21:07:22

-bash-3.2$ grep cms showqr.log | sort -k 4 -r | tail
1996470             R  lrm  56.38      1.0  -    cms352      cms            wn121     1  2:02:14:33  Wed Nov 10 16:02:30
1996390             R  lrm  55.59      1.0  -    cms430      cms            wn135     1  2:02:06:16  Wed Nov 10 15:54:13
1996750             R  lrm  54.65      1.0  -    cms034      cms            wn112     1  2:02:44:12  Wed Nov 10 16:32:09
1996466             R  lrm  54.41      1.0  -    cms430      cms            wn109     1  2:02:11:35  Wed Nov 10 15:59:32
1995825             R  lrm  54.34      1.0  -    cms485      cms            wn172     1  2:01:27:11  Wed Nov 10 15:15:08
1995824             R  lrm  54.15      1.0  -    cms430      cms            wn181     1  2:01:27:11  Wed Nov 10 15:15:08
1995826             R  lrm  53.61      1.0  -    cms485      cms            wn165     1  2:01:27:11  Wed Nov 10 15:15:08
1995823             R  lrm  53.21      1.0  -    cms430      cms            wn182     1  2:01:27:11  Wed Nov 10 15:15:08
1995821             R  lrm  52.67      1.0  -    cms485      cms            wn143     1  2:01:27:11  Wed Nov 10 15:15:08
1996375             R  lrm  47.99      1.0  -    cms034      cms            wn171     1  2:02:06:16  Wed Nov 10 15:54:13

Fairshare information

diagnose -f

FairShare Information

Depth: 7 intervals   Interval Length: 1:00:00:00   Decay Rate: 0.70

FS Policy: DEDICATEDPS
System FS Settings:  Target Usage: 0.00    Flags: 0

FSInterval        %     Target       0       1       2       3       4       5       6
FSWeight       ------- -------  1.0000  0.7000  0.4900  0.3430  0.2401  0.1681  0.1176
TotalUsage      100.00 ------- 14715.8 24937.3 14253.4    70.4 12515.5 26208.6 26610.8

...
...

GROUP
-------------
atlas            40.06 -------   25.60   40.09   33.79 -------   64.83   72.68   52.49
cms              37.82 -------   71.11   30.68   14.52   51.02    8.18   15.30   33.19
pricms*           0.00  10.00  ------- ------- ------- -------    0.00 ------- -------
prdcms           18.85  20.00  -------   27.01   45.12   47.82   26.80    6.58   12.81
dech              0.00 ------- ------- -------    0.00 ------- ------- ------- -------
lhcb              0.00 -------    0.01    0.00    0.01 ------- -------    0.00    0.00
hone              0.08 -------    0.06    0.11    0.07 -------    0.04    0.10    0.11
ops               0.01 -------    0.00    0.01    0.01    1.16    0.03    0.02    0.03
lhcbplt*          3.16  18.00     3.22    2.09    6.49 -------    0.11    5.33    1.37

ACCT
-------------

QOS
-------------

CLASS
-------------
atlas            40.05  40.00    25.59   40.08   33.78 -------   64.81   72.65   52.48
cms*             56.67  40.00    71.11   57.68   59.63   98.84   34.95   21.85   45.97
lhcb*             3.17  20.00     3.23    2.09    6.50 -------    0.11    5.33    1.37
cscs              0.00 ------- ------- -------    0.00 ------- ------- ------- -------
lcgadmin          0.03 -------    0.01    0.03    0.02 -------    0.06    0.05    0.05
ops               0.01 -------    0.00    0.01    0.01    1.16    0.03    0.02    0.03
other             0.08 -------    0.06    0.11    0.07 -------    0.04    0.10    0.11

CMS dashboard reports the following production job usage over the last days

time interval terminated jobs successful jobs Walltime (WrapWC)
2010-11-07 14:00:00 to 2010-11-08 14:00:00 N.A. N.A. N.A. (downtime)
2010-11-08 14:00:00 to 2010-11-09 14:00:00 602 598 17603852
2010-11-09 14:00:00 to 2010-11-10 14:00:00 1800 1785 29278990
2010-11-10 14:00:00 to 2010-11-11 14:00:00 0 0 0

Nov 11 14:11 [root@lrms02:~]# checkjob 1992979
job 1992979

AName: STDIN
State: Idle
Creds:  user:prdcms11  group:prdcms  class:cms
WallTime:   00:00:00 of 3:00:00:00
SubmitTime: Wed Nov 10 10:26:47
  (Time Queued  Total: 1:03:51:56  Eligible: 1:02:39:19)

Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Dedicated Resources Per Task: PROCS: 1  MEM: 2000M



Flags:          RESTARTABLE,FSVIOLATION
Attr:           FSVIOLATION,checkpoint
StartPriority:  -1301
NOTE:  job req cannot run in partition lrms02 (available procs do not meet requirements : 0 of 1 procs found)
idle procs: 397  feasible procs:   0

Node Rejection Summary: [Memory: 84][State: 1][Reserved: 11]



Nov 11 14:20 [root@lrms02:~]# showstart 1992979
job 1992979 requires 1 proc for 3:00:00:00

Estimated Rsv based start in                21:41:07 on Fri Nov 12 12:02:31
Estimated Rsv based completion in         3:21:41:07 on Mon Nov 15 12:02:31

Best Partition: lrms02

-- DerekFeichtinger - 2010-11-11


Arrow left Go to previous page / next page of CMS site log MOVED TO...

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2010-11-11 - DerekFeichtinger
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback