Fair-share Meeting on 2018-11-13
- Date and time: 13 November 2018 14:00-15:00. (UTC+01:00) Belgrade, Bratislava, Budapest, Ljubljana, Prague
- Place: CSCS Meeting Room 1st Floor (F1)
- External link:
Web Portal Address: https://vcmeeting.ethz.ch
SCOPIA meeting ID 6708365
SCOPIA via phone +41 43 244 89 30 | 6708365#
Agenda
- Fair-share problem introduction (ATLAS)
- Share issue first flagged on Piz Daint during the LHConCray commissioning project (1 year ago)
- Not enough job pressure from CMS
- Relative shares between ATLAS and LHCb skewed in favour of LHCb
- Raised again at the f2f meeting on 21st June in ZH
- In September realised that the issue has shown up on Phoenix too since ~May 2018
- Hard to keep track of, since monitoring dashboards cannot be accessed
- Did some investigations with Dino and discussed further f2f with Pablo & Dino again
What is the fair-share problem?
-
- ATLAS MultiCore jobs wait too long in the queue compared to single core jobs
- ATLAS: ~80% MC, ~20% SC
- 1 job=1 payload
- internal fair-share done at the factory level, passed to the sites in the form of an ARC job option => lowers priority
- walltime request passed to the sites in the form of an ARC job option (tuned to the payload to be executed)
- CMS: 100% MC
- 8-core (configurable) pilots sent to the sites
- internal fair-share done at the factory level, 8-core pilots pull multiple MC and SC payloads
- walltime request configured at the factory level (arbitrary number)
- LHCb: 100% SC
- 1-core pilots sent to the sites
- internal fair-share done at the factory level, 1-core pilots pull SC payloads
- wall request configured at CSCS. NOTE: this can be done at the factory level (arbitrary number)
Why does that happen?:
-
- Common problem to the large shared sites: SC vs MC scheduling: node fragmenting and backfill favour SC
- SC slots are held due to the long running configured Walltime
- SLURM is not an HTC schduler, in the conditions shown above it is hard to judge whether it makes the right scheduling decisions according to its target settings
- Factors that have an impact:
- SC vs MC imbalance (per user)
- cputime imbalance (per user)
- backfill (although this should favour shorter jobs, these are MC jobs)
- ATLAS job nice-ing (however this is turned off on Daint, but we need it)
- Number of queued jobs (per user): is this balanced?
Impact on ATLAS
-
- relative shares between experiments skewed to ATLAS disadvantage
- CPU delivery for ATLAS is really bumpy [1] [2]
- jobs often wait too long and/or are cancelled by the experiment and re-directed somewhere else
- this harms several workflows, specifically those that have higher (internal) priority
- if we host data, we should have an adequate amount of resources available at any time for processing (~40% of the total as baseline)
- we need to turn back on internal fair-share between the workloads
- Options / proposals
- Sites have in general invested large efforts in the past and cooked their own recipies (but I know no shared site using SLURM)
- Option 1:
- Track and fix the fair share. In order for such effort to be optimised, we need access to the relevant debugging dashboards
- Might be a labor intensive task
- Needs changes to the current shared model, very likely compromises between job-length and MC vs SC balance
- Might not satisfy each experiment requirements (e.g., long jobs, or job nice-ing, etc)
- Suggestion: pack the nodes with single core jobs first, rather than distributing them across the nodes
- ...
- Option 1:
- Split resources according the the fair-share quotas and allow each experiment to submit to the other partitions on a pre-emptable basis. NOTE: pre-emptable means job KILLED, not checkpointed
- Each experiment has their own quota and we delegate to them to claim any resource not used by another experiment
- Each experiment can shape their jobs as they wish
- ...
- Option 3:
- CSCS view
- Experiment views
- Next step(s)
- AOB
Attendants
- Roland
- Christoph
- Ginafranco
- Thomas
- Stefano
- Nicholas
- Dino
- Gianni
- Miguel
Minutes
Action items
This topic: LCGTier2
> WebHome >
ToolsBoard >
FormsAndTemplates > MeetingFairShares20181113
Topic revision: r4 - 2018-11-13 - GianfrancoSciacca