<!-- keep this as a security measure:<br /> * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup<br /> * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup<br /> #uncomment this if you want the page only be viewable by the internal people<br /> #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup<br />--> ---+ Meeting at ETH to discuss/optimize Pitz Daint 2020-01-16 * *Place*: CLA D17 ETHZ %TOC% ---++ Slides to guide the discussion Slides: [[%ATTACHURL%/20200116_ETHmeeting.pdf][20200116_ETHmeeting.pdf]] ---++ Minutes (also at [[https://docs.google.com/document/d/1Abv4LyD-O5tCGKZy3s2bb47jnx1sxHTuoaT9UtlugSE/edit][https://docs.google.com/document/d/1Abv4LyD-O5tCGKZy3s2bb47jnx1sxHTuoaT9UtlugSE]] ) <p dir="ltr"> </p> <p dir="ltr">Resources sharing:</p> *Fixing the ATLAS dips* <p dir="ltr">What does “ATLAS flat” usage mean ? Narrower oscillation of the #nodes used max +/- 20%</p> <p dir="ltr">Ideas:</p> * <p dir="ltr">Fixed partitions: 40% allocated to ATLAS</p> * <p dir="ltr">Dynamic allocation:</p> * <p dir="ltr">High priority to CHIPP for a node (so high you kill the others)</p> * <p dir="ltr">Technically may be limited by I/O ?</p> * <p dir="ltr">Memory limited ? only some nodes can be used</p> * <p dir="ltr">Proved with the T0 test</p> * <p dir="ltr">Risk to pay for idle usage → Accounting</p> * <p dir="ltr">Experiment will have to tune the load not to continuosly get to the max (e.g. 200 on average with a max on 250 nodes)</p> * <p dir="ltr">Trial an error on a ~month to see how to deal with the load tuning</p> * <p dir="ltr">Jobs during the T0 test were starting immediately. Check the draining mechanism of the nodes (there was no 5 days queue at that time)</p> * <p dir="ltr">If we complete our budget ahead of time, what do we do ? if the cap is small should not be an issue</p> * <p dir="ltr">We can have a mixture of dynamic allocation + fixed one</p> * <p dir="ltr">Overlapping partition with a cap on the number of nodes. If nodes are not used anybody can use the nodes (even outside CHIPP)</p> * <p dir="ltr">“Amazon example”: On demand / Reservation unused nodes are wasted, to cope with it they increase the price</p> * <p dir="ltr">Nodes “Reservation” : need to move from core/hours → node/hours</p> <p dir="ltr">for all VOs. Having only part of it, is difficult</p> <p dir="ltr">Accounting going to the VOs ?</p> <p dir="ltr"> </p> <p dir="ltr">Go for node allocation instead of core allocation “user segregation”</p> <p dir="ltr">Jobs will have to take whole nodes instead of cores</p> <p dir="ltr">PLAN:</p> <p dir="ltr"><img alt="" height="181" src="https://lh3.googleusercontent.com/zy7DiPZ4ajRKg3pccMlqfYbXdaEVy5j6sjM3Jq9vsTNeDXZpRoGv8nxxWOlNLTL9UweOvBaema6OYeP2gLyfm5_QAt6Cc4PbgE-333HYUXAuZpY7DXor1LKDy8dyUQJrFd2jEKKH" width="280" /></p> <p dir="ltr">WITHIN THE BOX (Box = CHIPP allocated nodes at CSCS)</p> * <p dir="ltr">IMPLEMENT THE “INTERNAL DYNAMIC ALLOCATION” (easy to implement - but the idle cost goes back to the VOs)</p> * <p dir="ltr">Fair share + optimized priority with reservations</p> * <p dir="ltr">When a VO comes back will take a higher priority until it gets back to its target, then go back to normal</p> * <p dir="ltr">Align the boundaries at the node level (see [1] below)</p> * <p dir="ltr">IMPLEMENT ON MON20 TEST UNTIL 29 JAN (MAINTENANCE).</p> * <p dir="ltr">PRELIMINARY NUMBERS to seize the shared resources:</p> * <p dir="ltr">CMS 50%</p> * <p dir="ltr">ATLAS 50%</p> * <p dir="ltr">LHCb 50%</p> <p dir="ltr">OUTSIDE THE BOX - Discussion to be started with M. DeLorenzi and CSCS CTO</p> * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “DYNAMIC ALLOCATION”</p> <p dir="ltr">- forced draining of nodes already in use “capped”</p> * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p> <p dir="ltr">- use only idle nodes</p> <p dir="ltr">- issue: there are very few idel nodes</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p> * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p> <p dir="ltr">- use only idle nodes</p> <p dir="ltr">- with short jobs “backfilling”</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p> <br /><br /><br /> <p dir="ltr"> *OVERALL UNDERUSAGE* </p> <p dir="ltr"><img alt="" height="189" src="https://lh3.googleusercontent.com/7ZI2CpXD8zdAM__MJ4A6MSyTDB9hjfHLuRMOjvmElMk3dxyCsiTzy0mLqUAIFfsWSuXzP24VtQeP7FMFMe8Czasrho-mINHJ4jYjLSxssBsySrpd5GzvD3Sm0UNjhsObfQDLI4ov" width="602" /></p> <br /><br /> <p dir="ltr">Equalize pledges to capacity (within the box). Does not work:</p> * <p dir="ltr">CSCS site availability goal 95%</p> * <p dir="ltr">Scheduler inefficiency</p> <p dir="ltr">When other sites show that full capacity is reachead they are using opportunistic resources</p> <p dir="ltr">Better situation in the last month</p> <p dir="ltr">Cvmfs needs a cache: RAM Cache limitation at Pitz Daint - strike a compromise by running cores or keep them idle to take their RAM:</p> <p dir="ltr">CMVMFS issue:</p> * <p dir="ltr">Crash when filling the cache</p> <p dir="ltr">Found a workaround.</p> <p dir="ltr">CSCS smaller cache than Bern T2, this can uncover bugs in e.g. CVMFS</p> <p dir="ltr">Help reducing cache usage:</p> * <p dir="ltr">At the moment we run all VOs in one node, i.e. 3 stacks of software in one node - go for user segregation: a portion of it (the one not in share resource band) can be done on the reservation test by drawing the boundaries on the node [1]</p> <p dir="ltr">IMPLEMENT A SAFE SITE LOG FOR CHIPP RESOURCES. Both CSCS and Experiment to compile it</p> <br /><br /><br /> <p dir="ltr">Items miscellanea</p> <p dir="ltr">- ATLAS own job micro-priority ( - - nice) => top priority now</p> <p dir="ltr">SLURM nice parameter - ATLAS computing model assumes that resources are available. ATLAS is managing internal priority of their jobs. Question is how to export this to present SLURM.</p> <p dir="ltr">Pilot Pull mode the system assigns the priority</p> <p dir="ltr">ATLAS Push mode priority encoded in the job</p> <p dir="ltr">Nice can be switched back on, but unclear how to monitor it:</p> * <p dir="ltr">Overall if not working it will be reflected in the <40% share</p> * <p dir="ltr">Still it will not be showing the internal ranking of priorities</p> <p dir="ltr">TRY TO SET IT TO A LOW VALUE AND TEST → SCHEDULED AFTER THE TEST OF [1]</p> <p dir="ltr">→ Give Gianfranco access to login on Daint and use sprior</p> <p dir="ltr"> </p> <p dir="ltr"> </p> <p dir="ltr"> </p> <p dir="ltr">- VO relative share (latest ticket closed, metrics not settled)</p> <p dir="ltr">→ already covered</p> <br /><br /> <p dir="ltr">- ATLAS ~flat delivery (+/- 20% from due core count) => now seldom a nucleus site</p> <p dir="ltr">→ already covered</p> <p dir="ltr">ARC metrics (monitoring and alarms) - since the dismissal of the ganglia monitoring which was available to us (few years)</p> * <p dir="ltr">Metric to monitor on ARC how many jobs are in which (internal) state and see whether you get the distribution of states you expect.</p> * <p dir="ltr">Ganglia was replaced by “elastic”</p> <p dir="ltr">→ check if possible to plug the monitoring package in elastic</p> <p dir="ltr">ATLAS HammerCloud status (monitoring and alarms)</p> * <p dir="ltr">To check status of ATLAS/CMS queues (online/blacklist) at a glance</p> <p dir="ltr">→ input from VOreps (provide the API call) to CSCS and then put on the dashboard</p> <p dir="ltr"> </p> <p dir="ltr">General:</p> <p dir="ltr">Timely dCache maintenance and upgrades to avoid disruptive upgrades. Inform VOs for plans and progress</p> * <p dir="ltr">Keep the upgrade in line with the rest of the community such that if an issue appears everybody is on it at the same time</p> <p dir="ltr">“Best practice”</p> <p dir="ltr"> </p> <p dir="ltr">Storage accounting implementation (WLCG / EGI )</p> * <p dir="ltr">Ask Dario to present plans for dCache at the next ops meeting</p> <h4 dir="ltr">People availability</h4> <p dir="ltr">Long delays in replying to operation issues: Is there any way to improve/help the situation ?</p> <p dir="ltr">- Nick: too many reporting avenues (giraticket slack calls etc…). Use the CSCS ticket system</p> <p dir="ltr">- if a problem is flagged on slack who’s</p> <p dir="ltr">submitting the ticket ?</p> * <p dir="ltr">Do not start the discussion on slack but file a ticket</p> * <p dir="ltr">For investigation try slack, might not work depending on the availability “best effort basis”</p> * <p dir="ltr">Target 3 hours to address general incidents</p> <p dir="ltr"> </p> <p dir="ltr">RT-tickets are sometimes closed without asking feedback from the VO-representative. Having feedback on the implemented changes can prevent mis-understandings/delays</p> * <p dir="ltr">Long term issues not fixed with tickets will be added to the action items of the monthly ops meeting agenda</p> <p dir="ltr"> </p> <h4 dir="ltr">Workflows</h4> <p dir="ltr">ATLAS is moving to a federated use of resources (CSCS + Bern) in Switzerland Storage will transition first (going in the direction of reducing the pressure on the dCache storage (or reducing the size of dCache)</p> <p dir="ltr">ATLAS full transition timescale 18 months. Prepare a plan for the transition, follow up in mothly ops meetings.</p> ---++ Attendants R. Bernet, N. Cardo, M. Donegà, D. Feichtinger, P. Fernandez, C. Grab, G. Sciacca, M. Weber ---++ Action items (9) Legend: <number.> title (added: date, done:date) %ICON{new}% / %ICON{done}% ---++++++ 1. Reduce ATLAS dips within the box: (added: 16.01.2020, done:)%ICON{new}% <p dir="ltr">(Box = CHIPP allocated nodes at CSCS)</p> * <p dir="ltr">IMPLEMENT THE “INTERNAL DYNAMIC ALLOCATION” (easy to implement - but the idle cost goes back to the VOs)</p> * <p dir="ltr">Fair share + optimized priority with reservations</p> * <p dir="ltr">When a VO comes back will take a higher priority until it gets back to its target, then go back to normal</p> * <p dir="ltr">Align the boundaries at the node level (see [1] below in item 3.)</p> * <p dir="ltr">IMPLEMENT ON MON20 TEST UNTIL 29 JAN (MAINTENANCE).</p> * <p dir="ltr">PRELIMINARY NUMBERS to seize the shared resources:</p> * <p dir="ltr">CMS 50%</p> * <p dir="ltr">ATLAS 50%</p> * <p dir="ltr">LHCb 50%</p> ---++++++ 2. Reduce ATLAS dips outside the box: (added: 16.01.2020, done:)%ICON{new}% - Discussion to be started with M. DeLorenzi and CSCS CTO* * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “DYNAMIC ALLOCATION”</p> <p dir="ltr">- forced draining of nodes already in use “capped”</p> * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p> <p dir="ltr">- use only idle nodes</p> <p dir="ltr">- issue: there are very few idel nodes</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p> * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p> <p dir="ltr">- use only idle nodes</p> <p dir="ltr">- with short jobs “backfilling”</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p> ---++++++ 3. Help reducing Cache occupancy (added: 16.01.2020, done:)%ICON{new}% At the moment we run all VOs in one node, i.e. 3 stacks of software in one node - go for user segregation: a portion of it (the one not in share resource band) can be done on the reservation test by drawing the boundaries on the node [1] (see item 1) ---++++++ 4. Site Log (added: 16.01.2020, done:)%ICON{new}% <p dir="ltr">IMPLEMENT A SAFE SITE LOG FOR CHIPP RESOURCES. Both CSCS and Experiment to compile it</p> ---++++++ 5. ATLAS --nice (added: 16.01.2020, done:)%ICON{new}% <p dir="ltr">TRY TO SET IT TO A LOW VALUE AND TEST → SCHEDULED AFTER THE TEST OF [1]</p> <p dir="ltr">→ Give Gianfranco access to login on Daint and use sprior</p> ---++++++ 6. ARC metrics (added: 16.01.2020, done:)%ICON{new}% → check if possible to plug the monitoring package in elastic ---++++++ 7. Queue status hammerclouds (added: 16.01.2020, done:)%ICON{new}% <p dir="ltr">→ input from VOreps (provide the API call) to CSCS and then put on the dashboard</p> ---++++++ 8. dCache updates (added: 16.01.2020, done:)%ICON{new}% Ask Dario to present plans for dCache at the next ops meeting ---++++++ 9. ATLAS transition to federated resources (added: 16.01.2020, done:) %ICON{new}% <p dir="ltr">ATLAS full transition timescale 18 months. Prepare a plan for the transition, follow up in mothly ops meetings.</p>
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
pdf
20200116_ETHmeeting.pdf
r1
manage
103.9 K
2020-01-17 - 13:46
MauroDonega
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingT2DiscussionETH20200116
Topic revision: r1 - 2020-01-17 - MauroDonega
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback