Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup --> %ICON{arrowleft}% Go to [[CMSSiteLog25][previous page]] / [[CMSSiteLog26][next page]] of CMS site log %M% ---+ !!11. 11. 2010 CMS production jobs starving in the queue %TOC% ---++ Problem formulation The site is completely filled with uscmspilot jobs that have been running for a long time. On top of that, somehow the CRAB submissions via the normal WMS are filling up the queues with thousands of jobs without regard to the information published by our Info system (GlueCEStateFreeJobSlots: 0, and a very high GlueCEStateEstimatedResponseTime). <br> *Why is the WMS filling up our queues and how could it happen that the pilot jobs were able to grab the whole site?* ---++ Queue status Qtop output for currently running jobs: <pre> ===> Worker Nodes occupancy <=== (you can read vertically the node IDs; nodes in free state are noted with - ) |000000000111111111122222|222223333333333444444444|455555555556666666666777|777777788888888889999999|={_Worker_}| |123456789012345678901234|567890123456789012345678|901234567890123456789012|345678901234567890123456|={__Node__}| |------------------------|----------------o-------|------------------------|------------------------|=Node state| |B9916BC6676671AB799696A6|4CC914996DC647AA_A911AA9|176677D17B676917DBB6C797|B167BAB7B9B6671946171966|=CPU0 |7111967977967661BC6C77A1|1176C697A4AC671A_C76A169|66611A7176717C6A1DAA11C7|97B696761777CA77B6B67711|=CPU1 |691477191777679967A676B6|71A1116B11A76161_6911461|C7A71166911BC791C7D646B7|16B19769A76AC769AC677766|=CPU2 |CCA64D19AB1B7DDCC6C6666A|BB6417A9671D169A_69A1AB6|BD9A4696716966667991C666|1C7B7BB77ADMAD1AB1767CC6|=CPU3 |769BD674616AD7CCB9A96B61|146176D7B117C716_7DBA679|A66_976977C1A67A719CCC67|9777A61DCA66B77B761B76C1|=CPU4 |9C4D_16A617661C16B116717|91761176D6766979_96AC797|A9_CD6_77177119676ABC9J7|67A6B6CB1C6AC6A1D9717476|=CPU5 |AC16A1B11C6AB196_BB61B17|1DD1996_11199711_C711B16|761A16677A41677BBC66A976|7696179776766C9C679B6A16|=CPU6 |7C677167199647191A11C16B|_61C1A1A11BA764C_7_6A_7B|B1A6979114CD7CA1CB7C6167|7B666B1DBC61916117779CAC|=CPU7 |AA4B171BA16A1CC111166641|1AB917671B1611C6_1D91B71|14C67A97DC17667111171A67|1B667676_1B179C76B767776|=CPU8 |47946D6D99DD761111BC7711|61111D67A1BD19A1_B1111DA|C1A6CB_AA69A9A1B411616B1|6A7C61BC9A6C117B7696997C|=CPU9 |DC17A4144A9D714411_4D66B|6D674166711B117C___D6616|9_B1A7_CD6B_6677C97717_9|7A167D196C6BC77117976BAD|=CPU10 |D_7117119DD14619A1D16DB1|C61161D_6161_41B_D_676D1|_6DC44_177A1__7B611BB66_|1471A6C46_16AA7771__61A6|=CPU11 |________________________|________________________|________________________|________________________|=CPU12 |________________________|________________________|________________________|________________________|=CPU13 |________________________|________________________|________________________|________________________|=CPU14 |________________________|________________________|________________________|________________________|=CPU15 ===> User accounts and pool mappings <=== ('all' includes those in C and W states, as reported by qstat) id| R + Q / all | unix account | Grid certificate DN (this info is only available under elevated privileges) 0 | 0 +1197 /1197 | cms107 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=malbouis/cn=411828/cn=helena brandao malbouisson:cms 1 |231 +489 / 721 | atlasprd | 2 | 0 +443 / 443 | cms228 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=trocino/cn=668497/cn=daniele trocino:cms 3 | 0 +400 / 400 | prdcms42 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=vdutta/cn=696509/cn=valentina dutta:prdcms 4 | 38 +320 / 361 | atlasplt | 5 | 0 +264 / 264 | prdcms11 | /c=be/o=begrid/ou=fynu/ou=ucl/cn=nicolas schul:prdcms 6 |228 + 8 / 236 | cms034 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot47/glidein-1.t2.ucsd.edu 7 |199 + 8 / 207 | cms485 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot49/glidein-1.t2.ucsd.edu 8 | 0 +200 / 200 | prdcms24 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=asciaba/cn=430796/cn=andrea sciaba:prdcms 9 | 98 + 4 / 102 | cms352 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot10/glidein-1.t2.ucsd.edu A | 94 + 4 / 98 | cms430 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot07/glidein-1.t2.ucsd.edu B | 86 + 4 / 90 | cms139 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot05/glidein-1.t2.ucsd.edu C | 83 + 4 / 87 | cms023 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot09/glidein-1.t2.ucsd.edu D | 51 + 14 / 66 | lhcbplt01 | /dc=es/dc=irisgrid/o=ecm-ub/cn=ricardo-graciani-diaz:lhcbplt E | 0 + 6 / 6 | honeprd | F | 0 + 5 / 5 | prdcms01 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=sciaba/cn=430796/cn=andrea sciaba:prdcms G | 0 + 5 / 5 | opssgm | H | 0 + 5 / 5 | lhcbplt09 | /o=grid-fr/c=fr/o=cnrs/ou=cppm/cn=andrei tsaregorodtsev:lhcbplt I | 0 + 3 / 3 | atlassgm | J | 1 + 2 / 3 | atlas107 | /o=germangrid/ou=lmu/cn=johannes elmsheuser:atlas K | 0 + 2 / 2 | lhcbprd01 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=santinel/cn=564059/cn=roberto santinelli:lhcb L | 0 + 2 / 2 | cmssgm | M | 1 + 0 / 1 | cms429 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=tjkim/cn=569507/cn=tae jeong kim:cms </pre> ---++ CMS dashboard <img src="%ATTACHURLPATH%/dashboard-20101111.png" alt="dashboard-20101111.png" width='1258' height='464' /> ---++ Mail discussion excerpts Here I found a good linke to the glidein pilot job architecture: http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/ Leo Sala: <pre> CSCS is the only T2 hosting the dataset (*), which was requested by almost all the jobs (~3.2k) (*) /BpToJPsiMuMu_2MuPEtaFilter_7TeV-pythia6-evtgen/Fall10-START38_V12-v1/GEN-SIM-RECO placement of this dataset has been done by the b-phys group (we are associated to them), so I would expect such bursts also in the future... but not enough to really replicate data (overall data space is quite an issue, if I've understood well) </pre> *Sanjay Padhi:* Before I start answering there are two hard corded conditions which the pilot has to follow * If there is no job for the pilot to grab, it will kill itself in 1 min. * For a site only 10 pilots are sent based on the initial number of real requests by the real jobs. There will be no pilots sent to a site until all 10 start running the real job. (Derek: With 10 pilots I think he means 10 distinct pilot grid identities, i.e. appearing as 10 different local users) * Pilot lifetime: Normally 24 hours, however if the pilots have nothing to do (meaning there are no real jobs), it will terminate itself within 1 min. * is there a unique correlation between the pilot DN and the user? No, the pilots are randomly selected for a given job. * Fairshare is accounted at the crabserver level for the real users. * The pilot fairshare is at the site level. Assuming all 10 pilots exhausted their share the pilots will have lower priority at that site. * Note: Last 24 hours there are > 240,000 analysis submissions to the grid, this is good as more users are doing analysis. *Analysis from Claudio Grandi (from multiple mails):* The decision of the gLite-WMS depends on the requirements by the users. It is likely that either they were processing data not available at other sites or that your site was whitelisted by the users. Obviously if other sites were publishing an even longer wait time your site would have been selected too. I look forward to the switching to cream where I understand it is also possible to configure a limit to the jobs accepted in the queues... If I understand correctly the situation, there are two main points: * *local fair share:* since there is no correlation between pilot DN and user DN the local fair share at the site has no meaning for pilots. Actually also the gLite-WMS approach assumes a FIFO within the same pool account. I already proposed to switch off the per-user fair share _within the same pool account_. This would make the gLite-WMS action much more effective and would avoid the need to use multiple DNs for pilots. There are many resistances by site managers about that but we must come back to this. * *job duration* since pilot jobs normally run many user jobs (I understood from Igor that if the job slot duration is published on the BDII the pilot continues fetching jobs up to a couple of hours before the slot expires). This is good from one point of view but it is clear that the "inertial mass" of the system increases: once a set of pilots have "taken possession" of job slots _if they have work to do_ they will keep it for a long time. If in the meantime jobs from other pool accounts that have higher priority arrive, they have to wait until the pilots end i.e. more or less the end of the duration of the slot. In order to optimize the situation the only way is to keep the duration of the slot reasonably short (I don't even consider job eviction as a possibility). ---++ Investigating scheduling and fairshare It turns out that the most recent current jobs have been running since yesterday around 21:40:14 <pre> -bash-3.2$ grep cms showq.log | grep Running | tail -n 10 1998991 cms034 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998992 cms139 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998993 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998994 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998995 cms034 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998997 cms139 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998999 cms430 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1999044 cms034 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1999045 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1999047 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14 </pre> How long have the prdcms jobs been waiting in the queue: <pre> bash-3.2$ grep cms showq.log | grep prdcms | head -n 10 1992979 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47 1992980 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47 1992981 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47 1992982 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47 1992983 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:48 1992984 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:48 1992985 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:53 1992986 prdcms42 Idle 1 3:00:00:00 Wed Nov 10 10:26:56 1992987 prdcms42 Idle 1 3:00:00:00 Wed Nov 10 10:26:56 1992988 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:27:49 -bash-3.2$ grep cms showq.log | grep prdcms | tail -n 10 2001747 prdcms24 Idle 1 3:00:00:00 Thu Nov 11 09:08:46 2001748 prdcms24 Idle 1 3:00:00:00 Thu Nov 11 09:08:46 2001749 prdcms24 Idle 1 3:00:00:00 Thu Nov 11 09:08:46 2001757 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 09:13:02 2002095 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34 2002096 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34 2002097 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34 2002098 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34 2002541 prdcms42 Idle 1 3:00:00:00 Thu Nov 11 12:41:54 2002542 prdcms42 Idle 1 3:00:00:00 Thu Nov 11 12:41:59 </pre> Running CMS jobs run with efficiencies from 50 to 95 percent. <pre> -bash-3.2$ grep cms showqr.log | sort -k 4 -r | head 1998539 R lrm 95.21 1.0 - cms485 cms wn189 1 2:07:06:16 Wed Nov 10 20:54:13 1998536 R lrm 95.16 1.0 - cms430 cms wn169 1 2:07:06:16 Wed Nov 10 20:54:13 1998631 R lrm 95.09 1.0 - cms485 cms wn172 1 2:07:21:58 Wed Nov 10 21:09:55 1998161 R lrm 94.87 1.0 - cms139 cms wn106 1 2:05:24:51 Wed Nov 10 19:12:48 1998596 R lrm 94.79 1.0 - cms034 cms wn135 1 2:07:14:10 Wed Nov 10 21:02:07 1998533 R lrm 94.75 1.0 - cms352 cms wn110 1 2:07:03:36 Wed Nov 10 20:51:33 1998582 R lrm 94.63 1.0 - cms034 cms wn178 1 2:07:14:10 Wed Nov 10 21:02:07 1998595 R lrm 94.59 1.0 - cms034 cms wn137 1 2:07:14:10 Wed Nov 10 21:02:07 1998532 R lrm 94.56 1.0 - cms023 cms wn116 1 2:07:03:36 Wed Nov 10 20:51:33 1998630 R lrm 94.45 1.0 - cms485 cms wn106 1 2:07:19:25 Wed Nov 10 21:07:22 -bash-3.2$ grep cms showqr.log | sort -k 4 -r | tail 1996470 R lrm 56.38 1.0 - cms352 cms wn121 1 2:02:14:33 Wed Nov 10 16:02:30 1996390 R lrm 55.59 1.0 - cms430 cms wn135 1 2:02:06:16 Wed Nov 10 15:54:13 1996750 R lrm 54.65 1.0 - cms034 cms wn112 1 2:02:44:12 Wed Nov 10 16:32:09 1996466 R lrm 54.41 1.0 - cms430 cms wn109 1 2:02:11:35 Wed Nov 10 15:59:32 1995825 R lrm 54.34 1.0 - cms485 cms wn172 1 2:01:27:11 Wed Nov 10 15:15:08 1995824 R lrm 54.15 1.0 - cms430 cms wn181 1 2:01:27:11 Wed Nov 10 15:15:08 1995826 R lrm 53.61 1.0 - cms485 cms wn165 1 2:01:27:11 Wed Nov 10 15:15:08 1995823 R lrm 53.21 1.0 - cms430 cms wn182 1 2:01:27:11 Wed Nov 10 15:15:08 1995821 R lrm 52.67 1.0 - cms485 cms wn143 1 2:01:27:11 Wed Nov 10 15:15:08 1996375 R lrm 47.99 1.0 - cms034 cms wn171 1 2:02:06:16 Wed Nov 10 15:54:13 </pre> Fairshare information <pre> diagnose -f FairShare Information Depth: 7 intervals Interval Length: 1:00:00:00 Decay Rate: 0.70 FS Policy: DEDICATEDPS System FS Settings: Target Usage: 0.00 Flags: 0 FSInterval % Target 0 1 2 3 4 5 6 FSWeight ------- ------- 1.0000 0.7000 0.4900 0.3430 0.2401 0.1681 0.1176 TotalUsage 100.00 ------- 14715.8 24937.3 14253.4 70.4 12515.5 26208.6 26610.8 ... ... GROUP ------------- atlas 40.06 ------- 25.60 40.09 33.79 ------- 64.83 72.68 52.49 cms 37.82 ------- 71.11 30.68 14.52 51.02 8.18 15.30 33.19 pricms* 0.00 10.00 ------- ------- ------- ------- 0.00 ------- ------- prdcms 18.85 20.00 ------- 27.01 45.12 47.82 26.80 6.58 12.81 dech 0.00 ------- ------- ------- 0.00 ------- ------- ------- ------- lhcb 0.00 ------- 0.01 0.00 0.01 ------- ------- 0.00 0.00 hone 0.08 ------- 0.06 0.11 0.07 ------- 0.04 0.10 0.11 ops 0.01 ------- 0.00 0.01 0.01 1.16 0.03 0.02 0.03 lhcbplt* 3.16 18.00 3.22 2.09 6.49 ------- 0.11 5.33 1.37 ACCT ------------- QOS ------------- CLASS ------------- atlas 40.05 40.00 25.59 40.08 33.78 ------- 64.81 72.65 52.48 cms* 56.67 40.00 71.11 57.68 59.63 98.84 34.95 21.85 45.97 lhcb* 3.17 20.00 3.23 2.09 6.50 ------- 0.11 5.33 1.37 cscs 0.00 ------- ------- ------- 0.00 ------- ------- ------- ------- lcgadmin 0.03 ------- 0.01 0.03 0.02 ------- 0.06 0.05 0.05 ops 0.01 ------- 0.00 0.01 0.01 1.16 0.03 0.02 0.03 other 0.08 ------- 0.06 0.11 0.07 ------- 0.04 0.10 0.11 </pre> CMS dashboard reports the following production job usage over the last days | *time interval* | *terminated jobs* | *successful jobs* | *Walltime (WrapWC)* | | 2010-11-07 14:00:00 to 2010-11-08 14:00:00 | N.A. | N.A. | N.A. (downtime) | | 2010-11-08 14:00:00 to 2010-11-09 14:00:00 | 602 | 598 | 17603852 | | 2010-11-09 14:00:00 to 2010-11-10 14:00:00 | 1800 | 1785 | 29278990 | | 2010-11-10 14:00:00 to 2010-11-11 14:00:00 | 0 | 0 | 0 | Checking that the fairshare calculation is correct <pre> (0*pow(0.7,0)*14715.8 + 27.01*pow(0.7,1)*24937.3 + 45.12*pow(0.7,2)*14253.4 + 47.82*pow(0.7,3)*70.4 + 26.80*pow(0.7,4)*12515.5 + 6.58*pow(0.7,5)*26208.6 +12.81*pow(0.7,6)*26610.8)/ (pow(0.7,0)*14715.8 + pow(0.7,1)*24937.3 + pow(0.7,2)*14253.4 + pow(0.7,3)*70.4 + pow(0.7,4)*12515.5 + pow(0.7,5)*26208.6 + pow(0.7,6)*26610.8) = 18.853110024838912 </pre> Trying to find out more about the first queued job <pre> Nov 11 14:11 [root@lrms02:~]# checkjob 1992979 job 1992979 AName: STDIN State: Idle Creds: user:prdcms11 group:prdcms class:cms WallTime: 00:00:00 of 3:00:00:00 SubmitTime: Wed Nov 10 10:26:47 (Time Queued Total: 1:03:51:56 Eligible: 1:02:39:19) Total Requested Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Dedicated Resources Per Task: PROCS: 1 MEM: 2000M Flags: RESTARTABLE,FSVIOLATION Attr: FSVIOLATION,checkpoint StartPriority: -1301 NOTE: job req cannot run in partition lrms02 (available procs do not meet requirements : 0 of 1 procs found) idle procs: 397 feasible procs: 0 Node Rejection Summary: [Memory: 84][State: 1][Reserved: 11] Nov 11 14:20 [root@lrms02:~]# showstart 1992979 job 1992979 requires 1 proc for 3:00:00:00 Estimated Rsv based start in 21:41:07 on Fri Nov 12 12:02:31 Estimated Rsv based completion in 3:21:41:07 on Mon Nov 15 12:02:31 Best Partition: lrms02 </pre> -- Main.DerekFeichtinger - 2010-11-11 ---------------- %ICON{arrowleft}% Go to [[CMSSiteLog25][previous page]] / [[CMSSiteLog26][next page]] of CMS site log %M%
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
png
dashboard-20101111.png
r1
manage
53.7 K
2010-11-11 - 16:20
DerekFeichtinger
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r4 - 2010-11-12
-
DerekFeichtinger
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback