Tags:
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSSiteLog25][previous page]] / [[CMSSiteLog26][next page]] of CMS site log %M% ---+ 11. 11. 2010 CMS production jobs starving in the queue Qtop output for currently running jobs: <pre> ===> Worker Nodes occupancy <=== (you can read vertically the node IDs; nodes in free state are noted with - ) |000000000111111111122222|222223333333333444444444|455555555556666666666777|777777788888888889999999|={_Worker_}| |123456789012345678901234|567890123456789012345678|901234567890123456789012|345678901234567890123456|={__Node__}| |------------------------|----------------o-------|------------------------|------------------------|=Node state| |B9916BC6676671AB799696A6|4CC914996DC647AA_A911AA9|176677D17B676917DBB6C797|B167BAB7B9B6671946171966|=CPU0 |7111967977967661BC6C77A1|1176C697A4AC671A_C76A169|66611A7176717C6A1DAA11C7|97B696761777CA77B6B67711|=CPU1 |691477191777679967A676B6|71A1116B11A76161_6911461|C7A71166911BC791C7D646B7|16B19769A76AC769AC677766|=CPU2 |CCA64D19AB1B7DDCC6C6666A|BB6417A9671D169A_69A1AB6|BD9A4696716966667991C666|1C7B7BB77ADMAD1AB1767CC6|=CPU3 |769BD674616AD7CCB9A96B61|146176D7B117C716_7DBA679|A66_976977C1A67A719CCC67|9777A61DCA66B77B761B76C1|=CPU4 |9C4D_16A617661C16B116717|91761176D6766979_96AC797|A9_CD6_77177119676ABC9J7|67A6B6CB1C6AC6A1D9717476|=CPU5 |AC16A1B11C6AB196_BB61B17|1DD1996_11199711_C711B16|761A16677A41677BBC66A976|7696179776766C9C679B6A16|=CPU6 |7C677167199647191A11C16B|_61C1A1A11BA764C_7_6A_7B|B1A6979114CD7CA1CB7C6167|7B666B1DBC61916117779CAC|=CPU7 |AA4B171BA16A1CC111166641|1AB917671B1611C6_1D91B71|14C67A97DC17667111171A67|1B667676_1B179C76B767776|=CPU8 |47946D6D99DD761111BC7711|61111D67A1BD19A1_B1111DA|C1A6CB_AA69A9A1B411616B1|6A7C61BC9A6C117B7696997C|=CPU9 |DC17A4144A9D714411_4D66B|6D674166711B117C___D6616|9_B1A7_CD6B_6677C97717_9|7A167D196C6BC77117976BAD|=CPU10 |D_7117119DD14619A1D16DB1|C61161D_6161_41B_D_676D1|_6DC44_177A1__7B611BB66_|1471A6C46_16AA7771__61A6|=CPU11 |________________________|________________________|________________________|________________________|=CPU12 |________________________|________________________|________________________|________________________|=CPU13 |________________________|________________________|________________________|________________________|=CPU14 |________________________|________________________|________________________|________________________|=CPU15 ===> User accounts and pool mappings <=== ('all' includes those in C and W states, as reported by qstat) id| R + Q / all | unix account | Grid certificate DN (this info is only available under elevated privileges) 0 | 0 +1197 /1197 | cms107 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=malbouis/cn=411828/cn=helena brandao malbouisson:cms 1 |231 +489 / 721 | atlasprd | 2 | 0 +443 / 443 | cms228 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=trocino/cn=668497/cn=daniele trocino:cms 3 | 0 +400 / 400 | prdcms42 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=vdutta/cn=696509/cn=valentina dutta:prdcms 4 | 38 +320 / 361 | atlasplt | 5 | 0 +264 / 264 | prdcms11 | /c=be/o=begrid/ou=fynu/ou=ucl/cn=nicolas schul:prdcms 6 |228 + 8 / 236 | cms034 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot47/glidein-1.t2.ucsd.edu 7 |199 + 8 / 207 | cms485 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot49/glidein-1.t2.ucsd.edu 8 | 0 +200 / 200 | prdcms24 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=asciaba/cn=430796/cn=andrea sciaba:prdcms 9 | 98 + 4 / 102 | cms352 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot10/glidein-1.t2.ucsd.edu A | 94 + 4 / 98 | cms430 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot07/glidein-1.t2.ucsd.edu B | 86 + 4 / 90 | cms139 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot05/glidein-1.t2.ucsd.edu C | 83 + 4 / 87 | cms023 | /dc=org/dc=doegrids/ou=services/cn=uscmspilot09/glidein-1.t2.ucsd.edu D | 51 + 14 / 66 | lhcbplt01 | /dc=es/dc=irisgrid/o=ecm-ub/cn=ricardo-graciani-diaz:lhcbplt E | 0 + 6 / 6 | honeprd | F | 0 + 5 / 5 | prdcms01 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=sciaba/cn=430796/cn=andrea sciaba:prdcms G | 0 + 5 / 5 | opssgm | H | 0 + 5 / 5 | lhcbplt09 | /o=grid-fr/c=fr/o=cnrs/ou=cppm/cn=andrei tsaregorodtsev:lhcbplt I | 0 + 3 / 3 | atlassgm | J | 1 + 2 / 3 | atlas107 | /o=germangrid/ou=lmu/cn=johannes elmsheuser:atlas K | 0 + 2 / 2 | lhcbprd01 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=santinel/cn=564059/cn=roberto santinelli:lhcb L | 0 + 2 / 2 | cmssgm | M | 1 + 0 / 1 | cms429 | /dc=ch/dc=cern/ou=organic units/ou=users/cn=tjkim/cn=569507/cn=tae jeong kim:cms </pre> It turns out that the most recent current jobs have been running since yesterday around 21:40:14 <pre> -bash-3.2$ grep cms showq.log | grep Running | tail -n 10 1998991 cms034 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998992 cms139 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998993 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998994 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998995 cms034 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998997 cms139 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1998999 cms430 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1999044 cms034 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1999045 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14 1999047 cms485 Running 1 2:07:58:48 Wed Nov 10 21:40:14 </pre> How long have the prdcms jobs been waiting in the queue: <pre> bash-3.2$ grep cms showq.log | grep prdcms | head -n 10 1992979 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47 1992980 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47 1992981 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47 1992982 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:47 1992983 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:48 1992984 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:48 1992985 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:26:53 1992986 prdcms42 Idle 1 3:00:00:00 Wed Nov 10 10:26:56 1992987 prdcms42 Idle 1 3:00:00:00 Wed Nov 10 10:26:56 1992988 prdcms11 Idle 1 3:00:00:00 Wed Nov 10 10:27:49 -bash-3.2$ grep cms showq.log | grep prdcms | tail -n 10 2001747 prdcms24 Idle 1 3:00:00:00 Thu Nov 11 09:08:46 2001748 prdcms24 Idle 1 3:00:00:00 Thu Nov 11 09:08:46 2001749 prdcms24 Idle 1 3:00:00:00 Thu Nov 11 09:08:46 2001757 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 09:13:02 2002095 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34 2002096 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34 2002097 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34 2002098 prdcms11 Idle 1 3:00:00:00 Thu Nov 11 11:53:34 2002541 prdcms42 Idle 1 3:00:00:00 Thu Nov 11 12:41:54 2002542 prdcms42 Idle 1 3:00:00:00 Thu Nov 11 12:41:59 </pre> Running CMS jobs run with efficiencies from 50 to 95 percent. <pre> -bash-3.2$ grep cms showqr.log | sort -k 4 -r | head 1998539 R lrm 95.21 1.0 - cms485 cms wn189 1 2:07:06:16 Wed Nov 10 20:54:13 1998536 R lrm 95.16 1.0 - cms430 cms wn169 1 2:07:06:16 Wed Nov 10 20:54:13 1998631 R lrm 95.09 1.0 - cms485 cms wn172 1 2:07:21:58 Wed Nov 10 21:09:55 1998161 R lrm 94.87 1.0 - cms139 cms wn106 1 2:05:24:51 Wed Nov 10 19:12:48 1998596 R lrm 94.79 1.0 - cms034 cms wn135 1 2:07:14:10 Wed Nov 10 21:02:07 1998533 R lrm 94.75 1.0 - cms352 cms wn110 1 2:07:03:36 Wed Nov 10 20:51:33 1998582 R lrm 94.63 1.0 - cms034 cms wn178 1 2:07:14:10 Wed Nov 10 21:02:07 1998595 R lrm 94.59 1.0 - cms034 cms wn137 1 2:07:14:10 Wed Nov 10 21:02:07 1998532 R lrm 94.56 1.0 - cms023 cms wn116 1 2:07:03:36 Wed Nov 10 20:51:33 1998630 R lrm 94.45 1.0 - cms485 cms wn106 1 2:07:19:25 Wed Nov 10 21:07:22 -bash-3.2$ grep cms showqr.log | sort -k 4 -r | tail 1996470 R lrm 56.38 1.0 - cms352 cms wn121 1 2:02:14:33 Wed Nov 10 16:02:30 1996390 R lrm 55.59 1.0 - cms430 cms wn135 1 2:02:06:16 Wed Nov 10 15:54:13 1996750 R lrm 54.65 1.0 - cms034 cms wn112 1 2:02:44:12 Wed Nov 10 16:32:09 1996466 R lrm 54.41 1.0 - cms430 cms wn109 1 2:02:11:35 Wed Nov 10 15:59:32 1995825 R lrm 54.34 1.0 - cms485 cms wn172 1 2:01:27:11 Wed Nov 10 15:15:08 1995824 R lrm 54.15 1.0 - cms430 cms wn181 1 2:01:27:11 Wed Nov 10 15:15:08 1995826 R lrm 53.61 1.0 - cms485 cms wn165 1 2:01:27:11 Wed Nov 10 15:15:08 1995823 R lrm 53.21 1.0 - cms430 cms wn182 1 2:01:27:11 Wed Nov 10 15:15:08 1995821 R lrm 52.67 1.0 - cms485 cms wn143 1 2:01:27:11 Wed Nov 10 15:15:08 1996375 R lrm 47.99 1.0 - cms034 cms wn171 1 2:02:06:16 Wed Nov 10 15:54:13 </pre> Fairshare information <pre> diagnose -f FairShare Information Depth: 7 intervals Interval Length: 1:00:00:00 Decay Rate: 0.70 FS Policy: DEDICATEDPS System FS Settings: Target Usage: 0.00 Flags: 0 FSInterval % Target 0 1 2 3 4 5 6 FSWeight ------- ------- 1.0000 0.7000 0.4900 0.3430 0.2401 0.1681 0.1176 TotalUsage 100.00 ------- 14715.8 24937.3 14253.4 70.4 12515.5 26208.6 26610.8 ... ... GROUP ------------- atlas 40.06 ------- 25.60 40.09 33.79 ------- 64.83 72.68 52.49 cms 37.82 ------- 71.11 30.68 14.52 51.02 8.18 15.30 33.19 pricms* 0.00 10.00 ------- ------- ------- ------- 0.00 ------- ------- prdcms 18.85 20.00 ------- 27.01 45.12 47.82 26.80 6.58 12.81 dech 0.00 ------- ------- ------- 0.00 ------- ------- ------- ------- lhcb 0.00 ------- 0.01 0.00 0.01 ------- ------- 0.00 0.00 hone 0.08 ------- 0.06 0.11 0.07 ------- 0.04 0.10 0.11 ops 0.01 ------- 0.00 0.01 0.01 1.16 0.03 0.02 0.03 lhcbplt* 3.16 18.00 3.22 2.09 6.49 ------- 0.11 5.33 1.37 ACCT ------------- QOS ------------- CLASS ------------- atlas 40.05 40.00 25.59 40.08 33.78 ------- 64.81 72.65 52.48 cms* 56.67 40.00 71.11 57.68 59.63 98.84 34.95 21.85 45.97 lhcb* 3.17 20.00 3.23 2.09 6.50 ------- 0.11 5.33 1.37 cscs 0.00 ------- ------- ------- 0.00 ------- ------- ------- ------- lcgadmin 0.03 ------- 0.01 0.03 0.02 ------- 0.06 0.05 0.05 ops 0.01 ------- 0.00 0.01 0.01 1.16 0.03 0.02 0.03 other 0.08 ------- 0.06 0.11 0.07 ------- 0.04 0.10 0.11 </pre> CMS dashboard reports the following production job usage over the last days | *time interval* | *terminated jobs* | *successful jobs* | *Walltime (WrapWC)* | | 2010-11-07 14:00:00 to 2010-11-08 14:00:00 | N.A. | N.A. | N.A. (downtime) | | 2010-11-08 14:00:00 to 2010-11-09 14:00:00 | 602 | 598 | 17603852 | | 2010-11-09 14:00:00 to 2010-11-10 14:00:00 | 1800 | 1785 | 29278990 | | 2010-11-10 14:00:00 to 2010-11-11 14:00:00 | 0 | 0 | 0 | Checking that the fairshare calculation is correct <pre> (0*pow(0.7,0)*14715.8 + 27.01*pow(0.7,1)*24937.3 + 45.12*pow(0.7,2)*14253.4 + 47.82*pow(0.7,3)*70.4 + 26.80*pow(0.7,4)*12515.5 + 6.58*pow(0.7,5)*26208.6 +12.81*pow(0.7,6)*26610.8)/ (pow(0.7,0)*14715.8 + pow(0.7,1)*24937.3 + pow(0.7,2)*14253.4 + pow(0.7,3)*70.4 + pow(0.7,4)*12515.5 + pow(0.7,5)*26208.6 + pow(0.7,6)*26610.8) = 18.853110024838912 </pre> Trying to find out more about the first queued job <pre> Nov 11 14:11 [root@lrms02:~]# checkjob 1992979 job 1992979 AName: STDIN State: Idle Creds: user:prdcms11 group:prdcms class:cms WallTime: 00:00:00 of 3:00:00:00 SubmitTime: Wed Nov 10 10:26:47 (Time Queued Total: 1:03:51:56 Eligible: 1:02:39:19) Total Requested Tasks: 1 Req[0] TaskCount: 1 Partition: ALL Dedicated Resources Per Task: PROCS: 1 MEM: 2000M Flags: RESTARTABLE,FSVIOLATION Attr: FSVIOLATION,checkpoint StartPriority: -1301 NOTE: job req cannot run in partition lrms02 (available procs do not meet requirements : 0 of 1 procs found) idle procs: 397 feasible procs: 0 Node Rejection Summary: [Memory: 84][State: 1][Reserved: 11] Nov 11 14:20 [root@lrms02:~]# showstart 1992979 job 1992979 requires 1 proc for 3:00:00:00 Estimated Rsv based start in 21:41:07 on Fri Nov 12 12:02:31 Estimated Rsv based completion in 3:21:41:07 on Mon Nov 15 12:02:31 Best Partition: lrms02 </pre> -- Main.DerekFeichtinger - 2010-11-11 ---------------- %ICON{arrowleft}% Go to [[CMSSiteLog25][previous page]] / [[CMSSiteLog26][next page]] of CMS site log %M%
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r2 - 2010-11-11
-
DerekFeichtinger
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback