CMSSiteLog26 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup
-->

%ICON{arrowleft}% Go to [[CMSSiteLog25][previous page]] / [[CMSSiteLog26][next page]] of CMS site log %M%

---+ !!11. 11. 2010 CMS production jobs starving in the queue

%TOC%

---++ Problem formulation

The site is completely filled with uscmspilot jobs that have been running for a long time. On top of that, somehow the CRAB submissions via the normal WMS are filling up the queues with thousands of jobs without regard to the information published by our Info system (GlueCEStateFreeJobSlots: 0, and a very high GlueCEStateEstimatedResponseTime).
<br> *Why is the WMS filling up our queues and how could it happen that the pilot jobs were able to grab the whole site?*

---++ Queue status

Qtop output for currently running jobs:
<pre>
===> Worker Nodes occupancy <=== (you can read vertically the node IDs; nodes in free state are noted with - )
|000000000111111111122222|222223333333333444444444|455555555556666666666777|777777788888888889999999|={_Worker_}|
|123456789012345678901234|567890123456789012345678|901234567890123456789012|345678901234567890123456|={__Node__}|
|------------------------|----------------o-------|------------------------|------------------------|=Node state|
|B9916BC6676671AB799696A6|4CC914996DC647AA_A911AA9|176677D17B676917DBB6C797|B167BAB7B9B6671946171966|=CPU0
|7111967977967661BC6C77A1|1176C697A4AC671A_C76A169|66611A7176717C6A1DAA11C7|97B696761777CA77B6B67711|=CPU1
|691477191777679967A676B6|71A1116B11A76161_6911461|C7A71166911BC791C7D646B7|16B19769A76AC769AC677766|=CPU2
|CCA64D19AB1B7DDCC6C6666A|BB6417A9671D169A_69A1AB6|BD9A4696716966667991C666|1C7B7BB77ADMAD1AB1767CC6|=CPU3
|769BD674616AD7CCB9A96B61|146176D7B117C716_7DBA679|A66_976977C1A67A719CCC67|9777A61DCA66B77B761B76C1|=CPU4
|9C4D_16A617661C16B116717|91761176D6766979_96AC797|A9_CD6_77177119676ABC9J7|67A6B6CB1C6AC6A1D9717476|=CPU5
|AC16A1B11C6AB196_BB61B17|1DD1996_11199711_C711B16|761A16677A41677BBC66A976|7696179776766C9C679B6A16|=CPU6
|7C677167199647191A11C16B|_61C1A1A11BA764C_7_6A_7B|B1A6979114CD7CA1CB7C6167|7B666B1DBC61916117779CAC|=CPU7
|AA4B171BA16A1CC111166641|1AB917671B1611C6_1D91B71|14C67A97DC17667111171A67|1B667676_1B179C76B767776|=CPU8
|47946D6D99DD761111BC7711|61111D67A1BD19A1_B1111DA|C1A6CB_AA69A9A1B411616B1|6A7C61BC9A6C117B7696997C|=CPU9
|DC17A4144A9D714411_4D66B|6D674166711B117C___D6616|9_B1A7_CD6B_6677C97717_9|7A167D196C6BC77117976BAD|=CPU10
|D_7117119DD14619A1D16DB1|C61161D_6161_41B_D_676D1|_6DC44_177A1__7B611BB66_|1471A6C46_16AA7771__61A6|=CPU11
|________________________|________________________|________________________|________________________|=CPU12
|________________________|________________________|________________________|________________________|=CPU13
|________________________|________________________|________________________|________________________|=CPU14
|________________________|________________________|________________________|________________________|=CPU15

===> User accounts and pool mappings <=== ('all' includes those in C and W states, as reported by qstat)
id|  R + Q  / all |  unix account  | Grid certificate DN (this info is only available under elevated privileges)
0 |  0 +1197 /1197 |         cms107 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=malbouis/cn=411828/cn=helena brandao malbouisson:cms
1 |231 +489 / 721 |       atlasprd | 
2 |  0 +443 / 443 |         cms228 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=trocino/cn=668497/cn=daniele trocino:cms
3 |  0 +400 / 400 |       prdcms42 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=vdutta/cn=696509/cn=valentina dutta:prdcms
4 | 38 +320 / 361 |       atlasplt | 
5 |  0 +264 / 264 |       prdcms11 |  /c=be/o=begrid/ou=fynu/ou=ucl/cn=nicolas schul:prdcms
6 |228 +  8 / 236 |         cms034 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot47/glidein-1.t2.ucsd.edu
7 |199 +  8 / 207 |         cms485 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot49/glidein-1.t2.ucsd.edu
8 |  0 +200 / 200 |       prdcms24 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=asciaba/cn=430796/cn=andrea sciaba:prdcms
9 | 98 +  4 / 102 |         cms352 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot10/glidein-1.t2.ucsd.edu
A | 94 +  4 /  98 |         cms430 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot07/glidein-1.t2.ucsd.edu
B | 86 +  4 /  90 |         cms139 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot05/glidein-1.t2.ucsd.edu
C | 83 +  4 /  87 |         cms023 |  /dc=org/dc=doegrids/ou=services/cn=uscmspilot09/glidein-1.t2.ucsd.edu
D | 51 + 14 /  66 |      lhcbplt01 |  /dc=es/dc=irisgrid/o=ecm-ub/cn=ricardo-graciani-diaz:lhcbplt
E |  0 +  6 /   6 |        honeprd | 
F |  0 +  5 /   5 |       prdcms01 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=sciaba/cn=430796/cn=andrea sciaba:prdcms
G |  0 +  5 /   5 |         opssgm | 
H |  0 +  5 /   5 |      lhcbplt09 |  /o=grid-fr/c=fr/o=cnrs/ou=cppm/cn=andrei tsaregorodtsev:lhcbplt
I |  0 +  3 /   3 |       atlassgm | 
J |  1 +  2 /   3 |       atlas107 |  /o=germangrid/ou=lmu/cn=johannes elmsheuser:atlas
K |  0 +  2 /   2 |      lhcbprd01 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=santinel/cn=564059/cn=roberto santinelli:lhcb
L |  0 +  2 /   2 |         cmssgm | 
M |  1 +  0 /   1 |         cms429 |  /dc=ch/dc=cern/ou=organic units/ou=users/cn=tjkim/cn=569507/cn=tae jeong kim:cms
</pre>


---++ CMS dashboard

     <img src="%ATTACHURLPATH%/dashboard-20101111.png" alt="dashboard-20101111.png" width='1258' height='464' />

---++ Mail discussion excerpts

Here I found a good linke to the glidein pilot job architecture: http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/

Leo Sala: <pre>
CSCS is the only T2 hosting the dataset (*), which 
was requested by almost all the jobs (~3.2k)

(*)   /BpToJPsiMuMu_2MuPEtaFilter_7TeV-pythia6-evtgen/Fall10-START38_V12-v1/GEN-SIM-RECO 

placement of this dataset has been done by the b-phys group (we are associated 
to them), so I would expect such bursts also in the future... but not enough 
to really replicate data (overall data space is quite an issue, if I've 
understood well)
</pre>


*Sanjay Padhi:*
Before I start answering there are two hard corded conditions
which the pilot has to follow
   * If there is no job for the pilot to grab, it will kill itself in 1 min.
   * For a site only 10 pilots are sent based on the initial number of real requests by the real jobs. There will be no pilots sent to a site until all 10 start running the real job. (Derek: With 10 pilots I think he means 10 distinct pilot grid identities, i.e. appearing as 10 different local users)
   * Pilot lifetime: Normally 24 hours, however if the pilots have nothing to do (meaning there are no real jobs), it will terminate itself within 1 min.
   * is there a unique correlation between the pilot DN and the user? No, the pilots are randomly selected for a given job. 
   * Fairshare is accounted at the crabserver level for the real users.
   * The pilot fairshare is at the site level. Assuming all 10 pilots exhausted their share the pilots will have lower priority at that site.
   * Note: Last 24 hours there are > 240,000 analysis submissions to the grid, this is good as more users are doing analysis.


*Analysis from Claudio Grandi (from multiple mails):*

The decision of the gLite-WMS depends on the requirements by the users. It is likely that either they were processing data not available at other sites or that your site was whitelisted by the users. Obviously if other sites were publishing an even longer wait time your site would have been selected too. I look forward to the switching to cream where I understand it is also possible to configure a limit to the jobs accepted in the queues...

If I understand correctly the situation, there are two main points:
   * *local fair share:* since there is no correlation between pilot DN and user DN the local fair share at the site has no meaning for pilots. Actually also the gLite-WMS approach assumes a FIFO within the same pool account. I already proposed to switch off the per-user fair share _within the same pool account_. This would make the gLite-WMS action much more effective and would avoid the need to use multiple DNs for pilots. There are many resistances by site managers about that but we must come back to this.
   * *job duration* since pilot jobs normally run many user jobs (I understood from Igor that if the job slot duration is published on the BDII the pilot continues fetching jobs up to a couple of hours before the slot expires). This is good from one point of view but it is clear that the "inertial mass" of the system increases: once a set of pilots have "taken possession" of job slots _if they have work to do_ they will keep it for a long time. If in the meantime jobs from other pool accounts that have higher priority arrive, they have to wait until the pilots end i.e. more or less the end of the duration of the slot. In order to optimize the situation the only way is to keep the duration of the slot reasonably short (I don't even consider job eviction as a possibility).

---++ Investigating scheduling and fairshare

It turns out that the most recent current jobs have been running since yesterday around 21:40:14

<pre>
-bash-3.2$ grep cms showq.log | grep Running | tail -n 10
1998991              cms034    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998992              cms139    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998993              cms485    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998994              cms485    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998995              cms034    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998997              cms139    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1998999              cms430    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1999044              cms034    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1999045              cms485    Running     1  2:07:58:48  Wed Nov 10 21:40:14
1999047              cms485    Running     1  2:07:58:48  Wed Nov 10 21:40:14
</pre>

How long have the prdcms jobs been waiting in the queue:

<pre>
bash-3.2$ grep cms showq.log | grep prdcms | head -n 10
1992979            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:47
1992980            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:47
1992981            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:47
1992982            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:47
1992983            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:48
1992984            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:48
1992985            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:26:53
1992986            prdcms42       Idle     1  3:00:00:00  Wed Nov 10 10:26:56
1992987            prdcms42       Idle     1  3:00:00:00  Wed Nov 10 10:26:56
1992988            prdcms11       Idle     1  3:00:00:00  Wed Nov 10 10:27:49

-bash-3.2$ grep cms showq.log | grep prdcms | tail -n 10
2001747            prdcms24       Idle     1  3:00:00:00  Thu Nov 11 09:08:46
2001748            prdcms24       Idle     1  3:00:00:00  Thu Nov 11 09:08:46
2001749            prdcms24       Idle     1  3:00:00:00  Thu Nov 11 09:08:46
2001757            prdcms11       Idle     1  3:00:00:00  Thu Nov 11 09:13:02
2002095            prdcms11       Idle     1  3:00:00:00  Thu Nov 11 11:53:34
2002096            prdcms11       Idle     1  3:00:00:00  Thu Nov 11 11:53:34
2002097            prdcms11       Idle     1  3:00:00:00  Thu Nov 11 11:53:34
2002098            prdcms11       Idle     1  3:00:00:00  Thu Nov 11 11:53:34
2002541            prdcms42       Idle     1  3:00:00:00  Thu Nov 11 12:41:54
2002542            prdcms42       Idle     1  3:00:00:00  Thu Nov 11 12:41:59
</pre>


Running CMS jobs run with efficiencies from 50 to 95 percent.
<pre>
-bash-3.2$ grep cms showqr.log | sort -k 4 -r | head
1998539             R  lrm  95.21      1.0  -    cms485      cms            wn189     1  2:07:06:16  Wed Nov 10 20:54:13
1998536             R  lrm  95.16      1.0  -    cms430      cms            wn169     1  2:07:06:16  Wed Nov 10 20:54:13
1998631             R  lrm  95.09      1.0  -    cms485      cms            wn172     1  2:07:21:58  Wed Nov 10 21:09:55
1998161             R  lrm  94.87      1.0  -    cms139      cms            wn106     1  2:05:24:51  Wed Nov 10 19:12:48
1998596             R  lrm  94.79      1.0  -    cms034      cms            wn135     1  2:07:14:10  Wed Nov 10 21:02:07
1998533             R  lrm  94.75      1.0  -    cms352      cms            wn110     1  2:07:03:36  Wed Nov 10 20:51:33
1998582             R  lrm  94.63      1.0  -    cms034      cms            wn178     1  2:07:14:10  Wed Nov 10 21:02:07
1998595             R  lrm  94.59      1.0  -    cms034      cms            wn137     1  2:07:14:10  Wed Nov 10 21:02:07
1998532             R  lrm  94.56      1.0  -    cms023      cms            wn116     1  2:07:03:36  Wed Nov 10 20:51:33
1998630             R  lrm  94.45      1.0  -    cms485      cms            wn106     1  2:07:19:25  Wed Nov 10 21:07:22

-bash-3.2$ grep cms showqr.log | sort -k 4 -r | tail
1996470             R  lrm  56.38      1.0  -    cms352      cms            wn121     1  2:02:14:33  Wed Nov 10 16:02:30
1996390             R  lrm  55.59      1.0  -    cms430      cms            wn135     1  2:02:06:16  Wed Nov 10 15:54:13
1996750             R  lrm  54.65      1.0  -    cms034      cms            wn112     1  2:02:44:12  Wed Nov 10 16:32:09
1996466             R  lrm  54.41      1.0  -    cms430      cms            wn109     1  2:02:11:35  Wed Nov 10 15:59:32
1995825             R  lrm  54.34      1.0  -    cms485      cms            wn172     1  2:01:27:11  Wed Nov 10 15:15:08
1995824             R  lrm  54.15      1.0  -    cms430      cms            wn181     1  2:01:27:11  Wed Nov 10 15:15:08
1995826             R  lrm  53.61      1.0  -    cms485      cms            wn165     1  2:01:27:11  Wed Nov 10 15:15:08
1995823             R  lrm  53.21      1.0  -    cms430      cms            wn182     1  2:01:27:11  Wed Nov 10 15:15:08
1995821             R  lrm  52.67      1.0  -    cms485      cms            wn143     1  2:01:27:11  Wed Nov 10 15:15:08
1996375             R  lrm  47.99      1.0  -    cms034      cms            wn171     1  2:02:06:16  Wed Nov 10 15:54:13
</pre>


Fairshare information

<pre>
diagnose -f

FairShare Information

Depth: 7 intervals   Interval Length: 1:00:00:00   Decay Rate: 0.70

FS Policy: DEDICATEDPS
System FS Settings:  Target Usage: 0.00    Flags: 0

FSInterval        %     Target       0       1       2       3       4       5       6
FSWeight       ------- -------  1.0000  0.7000  0.4900  0.3430  0.2401  0.1681  0.1176
TotalUsage      100.00 ------- 14715.8 24937.3 14253.4    70.4 12515.5 26208.6 26610.8

...
...

GROUP
-------------
atlas            40.06 -------   25.60   40.09   33.79 -------   64.83   72.68   52.49
cms              37.82 -------   71.11   30.68   14.52   51.02    8.18   15.30   33.19
pricms*           0.00  10.00  ------- ------- ------- -------    0.00 ------- -------
prdcms           18.85  20.00  -------   27.01   45.12   47.82   26.80    6.58   12.81
dech              0.00 ------- ------- -------    0.00 ------- ------- ------- -------
lhcb              0.00 -------    0.01    0.00    0.01 ------- -------    0.00    0.00
hone              0.08 -------    0.06    0.11    0.07 -------    0.04    0.10    0.11
ops               0.01 -------    0.00    0.01    0.01    1.16    0.03    0.02    0.03
lhcbplt*          3.16  18.00     3.22    2.09    6.49 -------    0.11    5.33    1.37

ACCT
-------------

QOS
-------------

CLASS
-------------
atlas            40.05  40.00    25.59   40.08   33.78 -------   64.81   72.65   52.48
cms*             56.67  40.00    71.11   57.68   59.63   98.84   34.95   21.85   45.97
lhcb*             3.17  20.00     3.23    2.09    6.50 -------    0.11    5.33    1.37
cscs              0.00 ------- ------- -------    0.00 ------- ------- ------- -------
lcgadmin          0.03 -------    0.01    0.03    0.02 -------    0.06    0.05    0.05
ops               0.01 -------    0.00    0.01    0.01    1.16    0.03    0.02    0.03
other             0.08 -------    0.06    0.11    0.07 -------    0.04    0.10    0.11
</pre>


CMS dashboard reports the following production job usage over the last days

| *time interval* | *terminated jobs* | *successful jobs* | *Walltime (WrapWC)* |
| 2010-11-07 14:00:00  to 2010-11-08 14:00:00 |  N.A. | N.A. | N.A. (downtime) |
| 2010-11-08 14:00:00  to 2010-11-09 14:00:00 | 602 | 598 | 17603852 |
| 2010-11-09 14:00:00  to 2010-11-10 14:00:00 | 1800 | 1785 | 29278990 |
| 2010-11-10 14:00:00  to 2010-11-11 14:00:00 | 0 | 0 | 0 |


Checking that the fairshare calculation is correct
<pre>
(0*pow(0.7,0)*14715.8 + 27.01*pow(0.7,1)*24937.3 + 45.12*pow(0.7,2)*14253.4 + 47.82*pow(0.7,3)*70.4 + 26.80*pow(0.7,4)*12515.5 +
      6.58*pow(0.7,5)*26208.6 +12.81*pow(0.7,6)*26610.8)/ (pow(0.7,0)*14715.8 + pow(0.7,1)*24937.3 + pow(0.7,2)*14253.4 + pow(0.7,3)*70.4 +
      pow(0.7,4)*12515.5 + pow(0.7,5)*26208.6 + pow(0.7,6)*26610.8) = 18.853110024838912
</pre>


Trying to find out more about the first queued job

<pre>
Nov 11 14:11 [root@lrms02:~]# checkjob 1992979
job 1992979

AName: STDIN
State: Idle
Creds:  user:prdcms11  group:prdcms  class:cms
WallTime:   00:00:00 of 3:00:00:00
SubmitTime: Wed Nov 10 10:26:47
  (Time Queued  Total: 1:03:51:56  Eligible: 1:02:39:19)

Total Requested Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Dedicated Resources Per Task: PROCS: 1  MEM: 2000M



Flags:          RESTARTABLE,FSVIOLATION
Attr:           FSVIOLATION,checkpoint
StartPriority:  -1301
NOTE:  job req cannot run in partition lrms02 (available procs do not meet requirements : 0 of 1 procs found)
idle procs: 397  feasible procs:   0

Node Rejection Summary: [Memory: 84][State: 1][Reserved: 11]



Nov 11 14:20 [root@lrms02:~]# showstart 1992979
job 1992979 requires 1 proc for 3:00:00:00

Estimated Rsv based start in                21:41:07 on Fri Nov 12 12:02:31
Estimated Rsv based completion in         3:21:41:07 on Mon Nov 15 12:02:31

Best Partition: lrms02
</pre>


-- Main.DerekFeichtinger - 2010-11-11

----------------

%ICON{arrowleft}% Go to [[CMSSiteLog25][previous page]] / [[CMSSiteLog26][next page]] of CMS site log %M%