(r1) MeetingT2DiscussionETH20200116 < LCGTier2

Tags: view all tags
&lt;!-- keep this as a security measure:<br /> * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup<br /> * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup<br /> #uncomment this if you want the page only be viewable by the internal people<br /> #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup<br />--&gt;

---+ Meeting at ETH to discuss/optimize Pitz Daint 2020-01-16
   * *Place*: CLA D17 ETHZ

%TOC%

---++ Slides to guide the discussion

Slides: [[%ATTACHURL%/20200116_ETHmeeting.pdf][20200116_ETHmeeting.pdf]]
---++ Minutes

(also at [[https://docs.google.com/document/d/1Abv4LyD-O5tCGKZy3s2bb47jnx1sxHTuoaT9UtlugSE/edit][https://docs.google.com/document/d/1Abv4LyD-O5tCGKZy3s2bb47jnx1sxHTuoaT9UtlugSE]] )

<p dir="ltr"> </p> <p dir="ltr">Resources sharing:</p>

*Fixing the ATLAS dips*

<p dir="ltr">What does “ATLAS flat” usage mean ? Narrower oscillation of the #nodes used max +/- 20%</p> <p dir="ltr">Ideas:</p>
   * <p dir="ltr">Fixed partitions: 40% allocated to ATLAS</p>
   * <p dir="ltr">Dynamic allocation:</p>
      * <p dir="ltr">High priority to CHIPP for a node (so high you kill the others)</p>
      * <p dir="ltr">Technically may be limited by I/O ?</p>
      * <p dir="ltr">Memory limited ? only some nodes can be used</p>
      * <p dir="ltr">Proved with the T0 test</p>
      * <p dir="ltr">Risk to pay for idle usage &rarr; Accounting</p>
      * <p dir="ltr">Experiment will have to tune the load not to continuosly get to the max (e.g. 200 on average with a max on 250 nodes)</p>
      * <p dir="ltr">Trial an error on a ~month to see how to deal with the load tuning</p>
      * <p dir="ltr">Jobs during the T0 test were starting immediately. Check the draining mechanism of the nodes (there was no 5 days queue at that time)</p>
      * <p dir="ltr">If we complete our budget ahead of time, what do we do ? if the cap is small should not be an issue</p>
      * <p dir="ltr">We can have a mixture of dynamic allocation + fixed one</p>
      * <p dir="ltr">Overlapping partition with a cap on the number of nodes. If nodes are not used anybody can use the nodes (even outside CHIPP)</p>
      * <p dir="ltr">“Amazon example”: On demand / Reservation unused nodes are wasted, to cope with it they increase the price</p>

   * <p dir="ltr">Nodes “Reservation” : need to move from core/hours &rarr; node/hours</p>
<p dir="ltr">for all VOs. Having only part of it, is difficult</p> <p dir="ltr">Accounting going to the VOs ?</p> <p dir="ltr"> </p> <p dir="ltr">Go for node allocation instead of core allocation “user segregation”</p> <p dir="ltr">Jobs will have to take whole nodes instead of cores</p> <p dir="ltr">PLAN:</p> <p dir="ltr"><img alt="" height="181" src="https://lh3.googleusercontent.com/zy7DiPZ4ajRKg3pccMlqfYbXdaEVy5j6sjM3Jq9vsTNeDXZpRoGv8nxxWOlNLTL9UweOvBaema6OYeP2gLyfm5_QAt6Cc4PbgE-333HYUXAuZpY7DXor1LKDy8dyUQJrFd2jEKKH" width="280" /></p> <p dir="ltr">WITHIN THE BOX (Box = CHIPP allocated nodes at CSCS)</p>

   * <p dir="ltr">IMPLEMENT THE “INTERNAL DYNAMIC ALLOCATION” (easy to implement - but the idle cost goes back to the VOs)</p>
      * <p dir="ltr">Fair share + optimized priority with reservations</p>
      * <p dir="ltr">When a VO comes back will take a higher priority until it gets back to its target, then go back to normal</p>
      * <p dir="ltr">Align the boundaries at the node level (see [1] below)</p>
      * <p dir="ltr">IMPLEMENT ON MON20 TEST UNTIL 29 JAN (MAINTENANCE).</p>
      * <p dir="ltr">PRELIMINARY NUMBERS to seize the shared resources:</p>
         * <p dir="ltr">CMS 50%</p>
         * <p dir="ltr">ATLAS 50%</p>
         * <p dir="ltr">LHCb 50%</p>
<p dir="ltr">OUTSIDE THE BOX - Discussion to be started with M. DeLorenzi and CSCS CTO</p>

   * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “DYNAMIC ALLOCATION”</p>
<p dir="ltr">- forced draining of nodes already in use “capped”</p>

   * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p>
<p dir="ltr">- use only idle nodes</p> <p dir="ltr">- issue: there are very few idel nodes</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p>

   * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p>
<p dir="ltr">- use only idle nodes</p> <p dir="ltr">- with short jobs “backfilling”</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p>

<br /><br /><br />

<p dir="ltr"> *OVERALL UNDERUSAGE* </p> <p dir="ltr"><img alt="" height="189" src="https://lh3.googleusercontent.com/7ZI2CpXD8zdAM__MJ4A6MSyTDB9hjfHLuRMOjvmElMk3dxyCsiTzy0mLqUAIFfsWSuXzP24VtQeP7FMFMe8Czasrho-mINHJ4jYjLSxssBsySrpd5GzvD3Sm0UNjhsObfQDLI4ov" width="602" /></p>

<br /><br />

<p dir="ltr">Equalize pledges to capacity (within the box). Does not work:</p>
   * <p dir="ltr">CSCS site availability goal 95%</p>
   * <p dir="ltr">Scheduler inefficiency</p>
<p dir="ltr">When other sites show that full capacity is reachead they are using opportunistic resources</p> <p dir="ltr">Better situation in the last month</p> <p dir="ltr">Cvmfs needs a cache: RAM Cache limitation at Pitz Daint - strike a compromise by running cores or keep them idle to take their RAM:</p> <p dir="ltr">CMVMFS issue:</p>
   * <p dir="ltr">Crash when filling the cache</p>
<p dir="ltr">Found a workaround.</p> <p dir="ltr">CSCS smaller cache than Bern T2, this can uncover bugs in e.g. CVMFS</p> <p dir="ltr">Help reducing cache usage:</p>
   * <p dir="ltr">At the moment we run all VOs in one node, i.e. 3 stacks of software in one node - go for user segregation: a portion of it (the one not in share resource band) can be done on the reservation test by drawing the boundaries on the node [1]</p>
<p dir="ltr">IMPLEMENT A SAFE SITE LOG FOR CHIPP RESOURCES. Both CSCS and Experiment to compile it</p>

<br /><br /><br />

<p dir="ltr">Items miscellanea</p> <p dir="ltr">- ATLAS own job micro-priority ( - - nice) =&gt; top priority now</p> <p dir="ltr">SLURM nice parameter - ATLAS computing model assumes that resources are available. ATLAS is managing internal priority of their jobs. Question is how to export this to present SLURM.</p> <p dir="ltr">Pilot Pull mode the system assigns the priority</p> <p dir="ltr">ATLAS Push mode priority encoded in the job</p> <p dir="ltr">Nice can be switched back on, but unclear how to monitor it:</p>
   * <p dir="ltr">Overall if not working it will be reflected in the &lt;40% share</p>
   * <p dir="ltr">Still it will not be showing the internal ranking of priorities</p>
<p dir="ltr">TRY TO SET IT TO A LOW VALUE AND TEST &rarr; SCHEDULED AFTER THE TEST OF [1]</p> <p dir="ltr">&rarr; Give Gianfranco access to login on Daint and use sprior</p> <p dir="ltr"> </p> <p dir="ltr"> </p> <p dir="ltr"> </p> <p dir="ltr">- VO relative share (latest ticket closed, metrics not settled)</p> <p dir="ltr">&rarr; already covered</p>

<br /><br />

<p dir="ltr">- ATLAS ~flat delivery (+/- 20% from due core count) =&gt; now seldom a nucleus site</p> <p dir="ltr">&rarr; already covered</p> <p dir="ltr">ARC metrics (monitoring and alarms) - since the dismissal of the ganglia monitoring which was available to us (few years)</p>
   * <p dir="ltr">Metric to monitor on ARC how many jobs are in which (internal) state and see whether you get the distribution of states you expect.</p>
   * <p dir="ltr">Ganglia was replaced by “elastic”</p>
<p dir="ltr">&rarr; check if possible to plug the monitoring package in elastic</p> <p dir="ltr">ATLAS HammerCloud status (monitoring and alarms)</p>
   * <p dir="ltr">To check status of ATLAS/CMS queues (online/blacklist) at a glance</p>
<p dir="ltr">&rarr; input from VOreps (provide the API call) to CSCS and then put on the dashboard</p> <p dir="ltr"> </p> <p dir="ltr">General:</p> <p dir="ltr">Timely dCache maintenance and upgrades to avoid disruptive upgrades. Inform VOs for plans and progress</p>
   * <p dir="ltr">Keep the upgrade in line with the rest of the community such that if an issue appears everybody is on it at the same time</p>
<p dir="ltr">“Best practice”</p> <p dir="ltr"> </p> <p dir="ltr">Storage accounting implementation (WLCG / EGI )</p>
   * <p dir="ltr">Ask Dario to present plans for dCache at the next ops meeting</p>
<h4 dir="ltr">People availability</h4> <p dir="ltr">Long delays in replying to operation issues: Is there any way to improve/help the situation ?</p> <p dir="ltr">- Nick: too many reporting avenues (giraticket slack calls etc…). Use the CSCS ticket system</p> <p dir="ltr">- if a problem is flagged on slack who’s</p> <p dir="ltr">submitting the ticket ?</p>
   * <p dir="ltr">Do not start the discussion on slack but file a ticket</p>
   * <p dir="ltr">For investigation try slack, might not work depending on the availability “best effort basis”</p>
   * <p dir="ltr">Target 3 hours to address general incidents</p>
<p dir="ltr"> </p> <p dir="ltr">RT-tickets are sometimes closed without asking feedback from the VO-representative. Having feedback on the implemented changes can prevent mis-understandings/delays</p>
   * <p dir="ltr">Long term issues not fixed with tickets will be added to the action items of the monthly ops meeting agenda</p>
<p dir="ltr"> </p> <h4 dir="ltr">Workflows</h4> <p dir="ltr">ATLAS is moving to a federated use of resources (CSCS + Bern) in Switzerland Storage will transition first (going in the direction of reducing the pressure on the dCache storage (or reducing the size of dCache)</p> <p dir="ltr">ATLAS full transition timescale 18 months. Prepare a plan for the transition, follow up in mothly ops meetings.</p>

---++ Attendants

R. Bernet, N. Cardo, M. Donegà, D. Feichtinger, P. Fernandez, C. Grab, G. Sciacca, M. Weber

---++ Action items (9)

Legend: &lt;number.&gt; title (added: date, done:date) %ICON{new}% / %ICON{done}%
---++++++ 1. Reduce ATLAS dips within the box: (added: 16.01.2020, done:)%ICON{new}%
<p dir="ltr">(Box = CHIPP allocated nodes at CSCS)</p>
   * <p dir="ltr">IMPLEMENT THE “INTERNAL DYNAMIC ALLOCATION” (easy to implement - but the idle cost goes back to the VOs)</p>
      * <p dir="ltr">Fair share + optimized priority with reservations</p>
      * <p dir="ltr">When a VO comes back will take a higher priority until it gets back to its target, then go back to normal</p>
      * <p dir="ltr">Align the boundaries at the node level (see [1] below in item 3.)</p>
      * <p dir="ltr">IMPLEMENT ON MON20 TEST UNTIL 29 JAN (MAINTENANCE).</p>
      * <p dir="ltr">PRELIMINARY NUMBERS to seize the shared resources:</p>
         * <p dir="ltr">CMS 50%</p>
         * <p dir="ltr">ATLAS 50%</p>
         * <p dir="ltr">LHCb 50%</p>
---++++++ 2. Reduce ATLAS dips outside the box: (added: 16.01.2020, done:)%ICON{new}%

- Discussion to be started with M. DeLorenzi and CSCS CTO*

   * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “DYNAMIC ALLOCATION”</p>
<p dir="ltr">- forced draining of nodes already in use “capped”</p>

   * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p>
<p dir="ltr">- use only idle nodes</p> <p dir="ltr">- issue: there are very few idel nodes</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p>

   * <p dir="ltr">START THE DISCUSSION TO GO FOR THE “opportunistic”</p>
<p dir="ltr">- use only idle nodes</p> <p dir="ltr">- with short jobs “backfilling”</p> <p dir="ltr">- (jobs has to be already in the queue - it cannot be detected)</p>
---++++++ 3. Help reducing Cache occupancy (added: 16.01.2020, done:)%ICON{new}%

At the moment we run all VOs in one node, i.e. 3 stacks of software in one node

- go for user segregation: a portion of it (the one not in share resource band) can be done on the reservation test by drawing the boundaries on the node [1] (see item 1)

---++++++ 4. Site Log (added: 16.01.2020, done:)%ICON{new}%
<p dir="ltr">IMPLEMENT A SAFE SITE LOG FOR CHIPP RESOURCES. Both CSCS and Experiment to compile it</p>

---++++++ 5. ATLAS --nice (added: 16.01.2020, done:)%ICON{new}%
<p dir="ltr">TRY TO SET IT TO A LOW VALUE AND TEST &rarr; SCHEDULED AFTER THE TEST OF [1]</p> <p dir="ltr">&rarr; Give Gianfranco access to login on Daint and use sprior</p>

---++++++ 6. ARC metrics (added: 16.01.2020, done:)%ICON{new}%

&rarr; check if possible to plug the monitoring package in elastic

---++++++ 7. Queue status hammerclouds (added: 16.01.2020, done:)%ICON{new}%
<p dir="ltr">&rarr; input from VOreps (provide the API call) to CSCS and then put on the dashboard</p>

---++++++ 8. dCache updates (added: 16.01.2020, done:)%ICON{new}%

Ask Dario to present plans for dCache at the next ops meeting

---++++++ 9. ATLAS transition to federated resources (added: 16.01.2020, done:) %ICON{new}%

<p dir="ltr">ATLAS full transition timescale 18 months. Prepare a plan for the transition, follow up in mothly ops meetings.</p>