Tags: view all tags

Fair-share Meeting on 2018-11-13

Date and time: 13 November 2018 14:00-15:00. (UTC+01:00) Belgrade, Bratislava, Budapest, Ljubljana, Prague
Place: CSCS Meeting Room 1st Floor (F1)
External link:
Web Portal Address: https://vcmeeting.ethz.ch

SCOPIA meeting ID 6708365

SCOPIA via phone +41 43 244 89 30 | 6708365#

Agenda

Fair-share problem introduction (ATLAS)
- Share issue first flagged on Piz Daint during the LHConCray commissioning project (1 year ago)
  - Not enough job pressure from CMS
  - Relative shares between ATLAS and LHCb skewed in favour of LHCb
- Raised again at the f2f meeting on 21st June in ZH
- In September realised that the issue has shown up on Phoenix too since ~May 2018
- Hard to keep track of, since monitoring dashboards cannot be accessed
- Did some investigations with Dino and discussed further f2f with Pablo & Dino again

What is the fair-share problem?

- ATLAS MultiCore jobs wait too long in the queue compared to single core jobs
  - ATLAS: ~80% MC, ~20% SC
    - 1 job=1 payload
    - internal fair-share done at the factory level, passed to the sites in the form of an ARC job option => lowers priority
    - walltime request passed to the sites in the form of an ARC job option (tuned to the payload to be executed)
  - CMS: 100% MC
    - 8-core (configurable) pilots sent to the sites
    - internal fair-share done at the factory level, 8-core pilots pull multiple MC and SC payloads
    - walltime request configured at the factory level (arbitrary number)
  - LHCb: 100% SC
    - 1-core pilots sent to the sites
    - internal fair-share done at the factory level, 1-core pilots pull SC payloads
    - wall request configured at CSCS. NOTE: this can be done at the factory level (arbitrary number)

Why does that happen?:

- Common problem to the large shared sites: SC vs MC scheduling: node fragmenting and backfill favour SC
- SC slots are held due to the long running configured Walltime
- SLURM is not an HTC schduler, in the conditions shown above it is hard to judge whether it makes the right scheduling decisions according to its target settings
- Factors that have an impact:
  - SC vs MC imbalance (per user)
  - cputime imbalance (per user)
  - backfill (although this should favour shorter jobs, these are MC jobs)
  - ATLAS job nice-ing (however this is turned off on Daint, but we need it)
  - Number of queued jobs (per user): is this balanced?

Impact on ATLAS

- relative shares between experiments skewed to ATLAS disadvantage
- CPU delivery for ATLAS is really bumpy [1] [2]
  - jobs often wait too long and/or are cancelled by the experiment and re-directed somewhere else
  - this harms several workflows, specifically those that have higher (internal) priority
  - if we host data, we should have an adequate amount of resources available at any time for processing (~40% of the total as baseline)
  - we need to turn back on internal fair-share between the workloads
Options / proposals
- Sites have in general invested large efforts in the past and cooked their own recipies (but I know no shared site using SLURM)
- Option 1:
  - Track and fix the fair share. In order for such effort to be optimised, we need access to the relevant debugging dashboards
  - Might be a labor intensive task
  - Needs changes to the current shared model, very likely compromises between job-length and MC vs SC balance
  - Might not satisfy each experiment requirements (e.g., long jobs, or job nice-ing, etc)
  - Suggestion: pack the nodes with single core jobs first, rather than distributing them across the nodes
  - ...
- Option 1:
  - Split resources according the the fair-share quotas and allow each experiment to submit to the other partitions on a pre-emptable basis. NOTE: pre-emptable means job KILLED, not checkpointed
  - Each experiment has their own quota and we delegate to them to claim any resource not used by another experiment
  - Each experiment can shape their jobs as they wish
  - ...
- Option 3:
  - ...
    
    [1] http://dashb-atlas-job.cern.ch/dashboard/request.py/resourceutilization_individual?sites=CSCS-LCG2&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2018-01-01&end=2018-10-31&timeRange=daily&granularity=Daily&generic=0&sortBy=20&diag1=0&diag2=0&diag3=0&diag4=0&diag5=0&diag6=0&diag7=0&diag8=0&diagT=0&diag8pl=0&series=All&type=a
    
    [2] http://dashb-atlas-job.cern.ch/dashboard/request.py/resourceutilization_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2018-01-01&end=2018-10-31&timeRange=daily&granularity=Daily&generic=0&sortBy=0&diag1=0&diag2=0&diag3=0&diag4=0&diag5=0&diag6=0&diag7=0&diag8=0&diagT=0&diag8pl=0&series=All&type=a
CSCS view
Experiment views
- CMS
- LHCb
Next step(s)
AOB