Tags:
view all tags
<!-- keep this as a security measure:<br /> * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup<br /> * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup<br /> #uncomment this if you want the page only be viewable by the internal people<br /> #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup<br />--> ---+ Meeting at ETH to discuss/optimize Pitz Daint 2020-01-16 * *Place*: CLA D17 ETHZ %TOC% ---++ Slides to guide the discussion Slides: [[%ATTACHURL%/20200116_ETHmeeting.pdf][20200116_ETHmeeting.pdf]] ---++ Minutes (also at [[https://docs.google.com/document/d/1Abv4LyD-O5tCGKZy3s2bb47jnx1sxHTuoaT9UtlugSE/edit][https://docs.google.com/document/d/1Abv4LyD-O5tCGKZy3s2bb47jnx1sxHTuoaT9UtlugSE]] ) <p dir="ltr"> </p> <p dir="ltr">Resources sharing:</p> *Fixing the ATLAS dips* <p dir="ltr">What does “ATLAS flat” usage mean ? Narrower oscillation of the #nodes used max +/- 20%</p> <p dir="ltr">Ideas:</p> * <p dir="ltr">Fixed partitions: 40% allocated to ATLAS</p> * <p dir="ltr">Dynamic allocation:</p> * <p dir="ltr">High priority to CHIPP for a node (so high you kill the others)</p> * <p dir="ltr">Technically may be limited by I/O ?</p> * <p dir="ltr">Memory limited ? only some nodes can be used</p> * <p dir="ltr">Proved with the T0 test</p> * <p dir="ltr">Risk to pay for idle usage → Accounting</p> * <p dir="ltr">Experiment will have to tune the load not to continuosly get to the max (e.g. 200 on average with a max on 250 nodes)</p> * <p dir="ltr">Trial an error on a ~month to see how to deal with the load tuning</p> * <p dir="ltr">Jobs during the T0 test were starting immediately. Check the draining mechanism of the nodes (there was no 5 days queue at that time)</p> * <p dir="ltr">If we complete our budget ahead of time, what do we do ? if the cap is small should not be an issue</p> * <p dir="ltr">We can have a mixture of dynamic allocation + fixed one</p> * <p dir="ltr">Overlapping partition with a cap on the number of nodes. If nodes are not used anybody can use the nodes (even outside CHIPP)</p> * <p dir="ltr">“Amazon example”: On demand / Reservation unused nodes are wasted, to cope with it they increase the price</p> * <p dir="ltr">Nodes “Reservation” : need to move from core/hours → node/hours</p> <p dir="ltr">for all VOs. Having only part of it, is difficult</p> <p dir="ltr">Accounting going to the VOs ?</p> <p dir="ltr"> </p> <p dir="ltr">Go for node allocation instead of core allocation “user segregation”</p> <p dir="ltr">Jobs will have to take whole nodes instead of cores</p> <p dir="ltr">PLAN:</p> <p dir="ltr"><img alt="" height="181" src="https://lh3.googleusercontent.com/zy7DiPZ4ajRKg3pccMlqfYbXdaEVy5j6sjM3Jq9vsTNeDXZpRoGv8nxxWOlNLTL9UweOvBaema6OYeP2gLyfm5_QAt6Cc4PbgE-333HYUXAuZpY7DXor1LKDy8dyUQJrFd2jEKKH" width="280" /></p> <p dir="ltr">WITHIN THE BOX (Box = CHIPP allocated nodes at CSCS)</p> * <p dir="ltr">IMPLEMENT THE “INTERNAL DYNAMIC ALLOCATION” (easy to implement - but the idle cost goes back to the VOs)</p> * <p dir="ltr">Fair share + optimized priority with reservations</p> * <p dir="ltr">When a VO comes back will take a higher priority until it gets back to its target, then go back to normal</p> * <p dir="ltr">Align the boundaries at the node level (see [1] below)</p> * <p dir="ltr">IMPLEMENT ON MON20 TEST UNTIL 29 JAN (MAINTENANCE).</p> * <p dir="ltr">PRELIMINARY NUMBERS to seize the shared resources:</p> * <p dir="ltr">CMS 50%</p> * <p dir="ltr">ATLAS 50%</p> * <p dir="ltr">LHCb 50%</p> <p dir="ltr">OUTSIDE THE BOX - Discussion to be started with M. DeLorenzi and CSCS CTO</p> * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “DYNAMIC ALLOCATION”</p> <p dir="ltr">- forced draining of nodes already in use “capped”</p> * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p> <p dir="ltr">- use only idle nodes</p> <p dir="ltr">- issue: there are very few idel nodes</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p> * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p> <p dir="ltr">- use only idle nodes</p> <p dir="ltr">- with short jobs “backfilling”</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p> <br /><br /><br /> <p dir="ltr"> *OVERALL UNDERUSAGE* </p> <p dir="ltr"><img alt="" height="189" src="https://lh3.googleusercontent.com/7ZI2CpXD8zdAM__MJ4A6MSyTDB9hjfHLuRMOjvmElMk3dxyCsiTzy0mLqUAIFfsWSuXzP24VtQeP7FMFMe8Czasrho-mINHJ4jYjLSxssBsySrpd5GzvD3Sm0UNjhsObfQDLI4ov" width="602" /></p> <br /><br /> <p dir="ltr">Equalize pledges to capacity (within the box). Does not work:</p> * <p dir="ltr">CSCS site availability goal 95%</p> * <p dir="ltr">Scheduler inefficiency</p> <p dir="ltr">When other sites show that full capacity is reachead they are using opportunistic resources</p> <p dir="ltr">Better situation in the last month</p> <p dir="ltr">Cvmfs needs a cache: RAM Cache limitation at Pitz Daint - strike a compromise by running cores or keep them idle to take their RAM:</p> <p dir="ltr">CMVMFS issue:</p> * <p dir="ltr">Crash when filling the cache</p> <p dir="ltr">Found a workaround.</p> <p dir="ltr">CSCS smaller cache than Bern T2, this can uncover bugs in e.g. CVMFS</p> <p dir="ltr">Help reducing cache usage:</p> * <p dir="ltr">At the moment we run all VOs in one node, i.e. 3 stacks of software in one node - go for user segregation: a portion of it (the one not in share resource band) can be done on the reservation test by drawing the boundaries on the node [1]</p> <p dir="ltr">IMPLEMENT A SAFE SITE LOG FOR CHIPP RESOURCES. Both CSCS and Experiment to compile it</p> <br /><br /><br /> <p dir="ltr">Items miscellanea</p> <p dir="ltr">- ATLAS own job micro-priority ( - - nice) => top priority now</p> <p dir="ltr">SLURM nice parameter - ATLAS computing model assumes that resources are available. ATLAS is managing internal priority of their jobs. Question is how to export this to present SLURM.</p> <p dir="ltr">Pilot Pull mode the system assigns the priority</p> <p dir="ltr">ATLAS Push mode priority encoded in the job</p> <p dir="ltr">Nice can be switched back on, but unclear how to monitor it:</p> * <p dir="ltr">Overall if not working it will be reflected in the <40% share</p> * <p dir="ltr">Still it will not be showing the internal ranking of priorities</p> <p dir="ltr">TRY TO SET IT TO A LOW VALUE AND TEST → SCHEDULED AFTER THE TEST OF [1]</p> <p dir="ltr">→ Give Gianfranco access to login on Daint and use sprior</p> <p dir="ltr"> </p> <p dir="ltr"> </p> <p dir="ltr"> </p> <p dir="ltr">- VO relative share (latest ticket closed, metrics not settled)</p> <p dir="ltr">→ already covered</p> <br /><br /> <p dir="ltr">- ATLAS ~flat delivery (+/- 20% from due core count) => now seldom a nucleus site</p> <p dir="ltr">→ already covered</p> <p dir="ltr">ARC metrics (monitoring and alarms) - since the dismissal of the ganglia monitoring which was available to us (few years)</p> * <p dir="ltr">Metric to monitor on ARC how many jobs are in which (internal) state and see whether you get the distribution of states you expect.</p> * <p dir="ltr">Ganglia was replaced by “elastic”</p> <p dir="ltr">→ check if possible to plug the monitoring package in elastic</p> <p dir="ltr">ATLAS HammerCloud status (monitoring and alarms)</p> * <p dir="ltr">To check status of ATLAS/CMS queues (online/blacklist) at a glance</p> <p dir="ltr">→ input from VOreps (provide the API call) to CSCS and then put on the dashboard</p> <p dir="ltr"> </p> <p dir="ltr">General:</p> <p dir="ltr">Timely dCache maintenance and upgrades to avoid disruptive upgrades. Inform VOs for plans and progress</p> * <p dir="ltr">Keep the upgrade in line with the rest of the community such that if an issue appears everybody is on it at the same time</p> <p dir="ltr">“Best practice”</p> <p dir="ltr"> </p> <p dir="ltr">Storage accounting implementation (WLCG / EGI )</p> * <p dir="ltr">Ask Dario to present plans for dCache at the next ops meeting</p> <h4 dir="ltr">People availability</h4> <p dir="ltr">Long delays in replying to operation issues: Is there any way to improve/help the situation ?</p> <p dir="ltr">- Nick: too many reporting avenues (giraticket slack calls etc…). Use the CSCS ticket system</p> <p dir="ltr">- if a problem is flagged on slack who’s</p> <p dir="ltr">submitting the ticket ?</p> * <p dir="ltr">Do not start the discussion on slack but file a ticket</p> * <p dir="ltr">For investigation try slack, might not work depending on the availability “best effort basis”</p> * <p dir="ltr">Target 3 hours to address general incidents</p> <p dir="ltr"> </p> <p dir="ltr">RT-tickets are sometimes closed without asking feedback from the VO-representative. Having feedback on the implemented changes can prevent mis-understandings/delays</p> * <p dir="ltr">Long term issues not fixed with tickets will be added to the action items of the monthly ops meeting agenda</p> <p dir="ltr"> </p> <h4 dir="ltr">Workflows</h4> <p dir="ltr">ATLAS is moving to a federated use of resources (CSCS + Bern) in Switzerland Storage will transition first (going in the direction of reducing the pressure on the dCache storage (or reducing the size of dCache)</p> <p dir="ltr">ATLAS full transition timescale 18 months. Prepare a plan for the transition, follow up in mothly ops meetings.</p> ---++ Attendants R. Bernet, N. Cardo, M. Donegà, D. Feichtinger, P. Fernandez, C. Grab, G. Sciacca, M. Weber ---++ Action items (9) Legend: <number.> title (added: date, done:date) %ICON{new}% / %ICON{done}% ---++++++ 1. Reduce ATLAS dips within the box: (added: 16.01.2020, done:)%ICON{new}% <p dir="ltr">(Box = CHIPP allocated nodes at CSCS)</p> * <p dir="ltr">IMPLEMENT THE “INTERNAL DYNAMIC ALLOCATION” (easy to implement - but the idle cost goes back to the VOs)</p> * <p dir="ltr">Fair share + optimized priority with reservations</p> * <p dir="ltr">When a VO comes back will take a higher priority until it gets back to its target, then go back to normal</p> * <p dir="ltr">Align the boundaries at the node level (see [1] below in item 3.)</p> * <p dir="ltr">IMPLEMENT ON MON20 TEST UNTIL 29 JAN (MAINTENANCE).</p> * <p dir="ltr">PRELIMINARY NUMBERS to seize the shared resources:</p> * <p dir="ltr">CMS 50%</p> * <p dir="ltr">ATLAS 50%</p> * <p dir="ltr">LHCb 50%</p> ---++++++ 2. Reduce ATLAS dips outside the box: (added: 16.01.2020, done:)%ICON{new}% - Discussion to be started with M. DeLorenzi and CSCS CTO* * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “DYNAMIC ALLOCATION”</p> <p dir="ltr">- forced draining of nodes already in use “capped”</p> * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p> <p dir="ltr">- use only idle nodes</p> <p dir="ltr">- issue: there are very few idel nodes</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p> * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p> <p dir="ltr">- use only idle nodes</p> <p dir="ltr">- with short jobs “backfilling”</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p> ---++++++ 3. Help reducing Cache occupancy (added: 16.01.2020, done:)%ICON{new}% At the moment we run all VOs in one node, i.e. 3 stacks of software in one node - go for user segregation: a portion of it (the one not in share resource band) can be done on the reservation test by drawing the boundaries on the node [1] (see item 1) ---++++++ 4. Site Log (added: 16.01.2020, done:)%ICON{new}% <p dir="ltr">IMPLEMENT A SAFE SITE LOG FOR CHIPP RESOURCES. Both CSCS and Experiment to compile it</p> ---++++++ 5. ATLAS --nice (added: 16.01.2020, done:)%ICON{new}% <p dir="ltr">TRY TO SET IT TO A LOW VALUE AND TEST → SCHEDULED AFTER THE TEST OF [1]</p> <p dir="ltr">→ Give Gianfranco access to login on Daint and use sprior</p> ---++++++ 6. ARC metrics (added: 16.01.2020, done:)%ICON{new}% → check if possible to plug the monitoring package in elastic ---++++++ 7. Queue status hammerclouds (added: 16.01.2020, done:)%ICON{new}% <p dir="ltr">→ input from VOreps (provide the API call) to CSCS and then put on the dashboard</p> ---++++++ 8. dCache updates (added: 16.01.2020, done:)%ICON{new}% Ask Dario to present plans for dCache at the next ops meeting ---++++++ 9. ATLAS transition to federated resources (added: 16.01.2020, done:) %ICON{new}% <p dir="ltr">ATLAS full transition timescale 18 months. Prepare a plan for the transition, follow up in mothly ops meetings.</p>
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
pdf
20200116_ETHmeeting.pdf
r1
manage
103.9 K
2020-01-17 - 13:46
MauroDonega
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r1 - 2020-01-17
-
MauroDonega
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback