Tags: view all tags

Meeting at ETH to discuss/optimize Pitz Daint 2020-01-16

Place: CLA D17 ETHZ

Meeting at ETH to discuss/optimize Pitz Daint 2020-01-16

Slides to guide the discussion

Slides: 20200116_ETHmeeting.pdf

Minutes

(also at https://docs.google.com/document/d/1Abv4LyD-O5tCGKZy3s2bb47jnx1sxHTuoaT9UtlugSE )

Resources sharing:

Fixing the ATLAS dips

What does “ATLAS flat” usage mean ? Narrower oscillation of the #nodes used max +/- 20%

Ideas:

Fixed partitions: 40% allocated to ATLAS
Dynamic allocation:
- High priority to CHIPP for a node (so high you kill the others)
- Technically may be limited by I/O ?
- Memory limited ? only some nodes can be used
- Proved with the T0 test
- Risk to pay for idle usage → Accounting
- Experiment will have to tune the load not to continuosly get to the max (e.g. 200 on average with a max on 250 nodes)
- Trial an error on a ~month to see how to deal with the load tuning
- Jobs during the T0 test were starting immediately. Check the draining mechanism of the nodes (there was no 5 days queue at that time)
- If we complete our budget ahead of time, what do we do ? if the cap is small should not be an issue
- We can have a mixture of dynamic allocation + fixed one
- Overlapping partition with a cap on the number of nodes. If nodes are not used anybody can use the nodes (even outside CHIPP)
- “Amazon example”: On demand / Reservation unused nodes are wasted, to cope with it they increase the price

Nodes “Reservation” : need to move from core/hours → node/hours

for all VOs. Having only part of it, is difficult

Accounting going to the VOs ?

Go for node allocation instead of core allocation “user segregation”

Jobs will have to take whole nodes instead of cores

PLAN:

WITHIN THE BOX (Box = CHIPP allocated nodes at CSCS)

IMPLEMENT THE “INTERNAL DYNAMIC ALLOCATION” (easy to implement - but the idle cost goes back to the VOs)
- Fair share + optimized priority with reservations
- When a VO comes back will take a higher priority until it gets back to its target, then go back to normal
- Align the boundaries at the node level (see [1] below)
- IMPLEMENT ON MON20 TEST UNTIL 29 JAN (MAINTENANCE).
- PRELIMINARY NUMBERS to seize the shared resources:
  - CMS 50%
  - ATLAS 50%
  - LHCb 50%

OUTSIDE THE BOX - Discussion to be started with M. DeLorenzi and CSCS CTO

START THE DISCUSSION TO GO FOR THE “DYNAMIC ALLOCATION”

- forced draining of nodes already in use “capped”

START THE DISCUSSION TO GO FOR THE “opportunistic”

- use only idle nodes

- issue: there are very few idel nodes

- (jobs has to be already in the queue - it cannot be detected)

START THE DISCUSSION TO GO FOR THE “opportunistic”

- use only idle nodes

- with short jobs “backfilling”

- (jobs has to be already in the queue - it cannot be detected)

OVERALL UNDERUSAGE

Equalize pledges to capacity (within the box). Does not work:

CSCS site availability goal 95%
Scheduler inefficiency

When other sites show that full capacity is reachead they are using opportunistic resources

Better situation in the last month

Cvmfs needs a cache: RAM Cache limitation at Pitz Daint - strike a compromise by running cores or keep them idle to take their RAM:

CMVMFS issue:

Crash when filling the cache

Found a workaround.

CSCS smaller cache than Bern T2, this can uncover bugs in e.g. CVMFS

Help reducing cache usage:

At the moment we run all VOs in one node, i.e. 3 stacks of software in one node - go for user segregation: a portion of it (the one not in share resource band) can be done on the reservation test by drawing the boundaries on the node [1]

IMPLEMENT A SAFE SITE LOG FOR CHIPP RESOURCES. Both CSCS and Experiment to compile it

Items miscellanea

- ATLAS own job micro-priority ( - - nice) => top priority now

SLURM nice parameter - ATLAS computing model assumes that resources are available. ATLAS is managing internal priority of their jobs. Question is how to export this to present SLURM.

Pilot Pull mode the system assigns the priority

ATLAS Push mode priority encoded in the job

Nice can be switched back on, but unclear how to monitor it:

Overall if not working it will be reflected in the <40% share
Still it will not be showing the internal ranking of priorities

TRY TO SET IT TO A LOW VALUE AND TEST → SCHEDULED AFTER THE TEST OF [1]

→ Give Gianfranco access to login on Daint and use sprior

- VO relative share (latest ticket closed, metrics not settled)

→ already covered

- ATLAS ~flat delivery (+/- 20% from due core count) => now seldom a nucleus site

→ already covered

ARC metrics (monitoring and alarms) - since the dismissal of the ganglia monitoring which was available to us (few years)

Metric to monitor on ARC how many jobs are in which (internal) state and see whether you get the distribution of states you expect.
Ganglia was replaced by “elastic”

→ check if possible to plug the monitoring package in elastic

ATLAS HammerCloud status (monitoring and alarms)

To check status of ATLAS/CMS queues (online/blacklist) at a glance

→ input from VOreps (provide the API call) to CSCS and then put on the dashboard

General:

Timely dCache maintenance and upgrades to avoid disruptive upgrades. Inform VOs for plans and progress

Keep the upgrade in line with the rest of the community such that if an issue appears everybody is on it at the same time

“Best practice”

Storage accounting implementation (WLCG / EGI )

Ask Dario to present plans for dCache at the next ops meeting

People availability

Long delays in replying to operation issues: Is there any way to improve/help the situation ?

- Nick: too many reporting avenues (giraticket slack calls etc…). Use the CSCS ticket system

- if a problem is flagged on slack who’s

submitting the ticket ?

Do not start the discussion on slack but file a ticket
For investigation try slack, might not work depending on the availability “best effort basis”
Target 3 hours to address general incidents

RT-tickets are sometimes closed without asking feedback from the VO-representative. Having feedback on the implemented changes can prevent mis-understandings/delays

Long term issues not fixed with tickets will be added to the action items of the monthly ops meeting agenda

Workflows

ATLAS is moving to a federated use of resources (CSCS + Bern) in Switzerland Storage will transition first (going in the direction of reducing the pressure on the dCache storage (or reducing the size of dCache)

ATLAS full transition timescale 18 months. Prepare a plan for the transition, follow up in mothly ops meetings.

Attendants

R. Bernet, N. Cardo, M. Donegà, D. Feichtinger, P. Fernandez, C. Grab, G. Sciacca, M. Weber

Action items (9)

Legend: <number.> title (added: date, done:date) /

1. Reduce ATLAS dips within the box: (added: 16.01.2020, done:)

(Box = CHIPP allocated nodes at CSCS)

IMPLEMENT THE “INTERNAL DYNAMIC ALLOCATION” (easy to implement - but the idle cost goes back to the VOs)
- Fair share + optimized priority with reservations
- When a VO comes back will take a higher priority until it gets back to its target, then go back to normal
- Align the boundaries at the node level (see [1] below in item 3.)
- IMPLEMENT ON MON20 TEST UNTIL 29 JAN (MAINTENANCE).
- PRELIMINARY NUMBERS to seize the shared resources:
  - CMS 50%
  - ATLAS 50%
  - LHCb 50%

2. Reduce ATLAS dips outside the box: (added: 16.01.2020, done:)

- Discussion to be started with M. DeLorenzi and CSCS CTO*

START THE DISCUSSION TO GO FOR THE “DYNAMIC ALLOCATION”

- forced draining of nodes already in use “capped”

START THE DISCUSSION TO GO FOR THE “opportunistic”

- use only idle nodes

- issue: there are very few idel nodes

- (jobs has to be already in the queue - it cannot be detected)

START THE DISCUSSION TO GO FOR THE “opportunistic”

- use only idle nodes

- with short jobs “backfilling”

- (jobs has to be already in the queue - it cannot be detected)

3. Help reducing Cache occupancy (added: 16.01.2020, done:)

At the moment we run all VOs in one node, i.e. 3 stacks of software in one node

- go for user segregation: a portion of it (the one not in share resource band) can be done on the reservation test by drawing the boundaries on the node [1] (see item 1)