Share issue first flagged on Piz Daint during the LHConCray commissioning project (1 year ago)
Not enough job pressure from CMS
Relative shares between ATLAS and LHCb skewed in favour of LHCb
Raised again at the f2f meeting on 21st June in ZH
In September realised that the issue has shown up on Phoenix too since ~May 2018
Hard to keep track of, since monitoring dashboards cannot be accessed
Did some investigations with Dino and discussed further f2f with Pablo & Dino again
What is the fair-share problem?
ATLAS MultiCore jobs wait too long in the queue compared to single core jobs
ATLAS: ~80% MC, ~20% SC
1 job=1 payload
internal fair-share done at the factory level, passed to the sites in the form of an ARC job option => lowers priority
walltime request passed to the sites in the form of an ARC job option (tuned to the payload to be executed)
CMS: 100% MC
8-core (configurable) pilots sent to the sites
internal fair-share done at the factory level, 8-core pilots pull multiple MC and SC payloads
walltime request configured at the factory level (arbitrary number)
LHCb: 100% SC
1-core pilots sent to the sites
internal fair-share done at the factory level, 1-core pilots pull SC payloads
wall request configured at CSCS. NOTE: this can be done at the factory level (arbitrary number)
Why does that happen?:
Common problem to the large shared sites: SC vs MC scheduling: node fragmenting and backfill favour SC
SC slots are held due to the long running configured Walltime
SLURM is not an HTC schduler, in the conditions shown above it is hard to judge whether it makes the right scheduling decisions according to its target settings
Factors that have an impact:
SC vs MC imbalance (per user)
cputime imbalance (per user)
backfill (although this should favour shorter jobs, these are MC jobs)
ATLAS job nice-ing (however this is turned off on Daint, but we need it)
Number of queued jobs (per user): is this balanced?
Impact on ATLAS
relative shares between experiments skewed to ATLAS disadvantage
CPU delivery for ATLAS is really bumpy [1] [2]
jobs often wait too long and/or are cancelled by the experiment and re-directed somewhere else
this harms several workflows, specifically those that have higher (internal) priority
if we host data, we should have an adequate amount of resources available at any time for processing (~40% of the total as baseline)
we need to turn back on internal fair-share between the workloads
Options / proposals
Sites have in general invested large efforts in the past and cooked their own recipies (but I know no shared site using SLURM)
Option 1:
Track and fix the fair share. In order for such effort to be optimised, we need access to the relevant debugging dashboards
Might be a labor intensive task
Needs changes to the current shared model, very likely compromises between job-length and MC vs SC balance
Might not satisfy each experiment requirements (e.g., long jobs, or job nice-ing, etc)
Suggestion: pack the nodes with single core jobs first, rather than distributing them across the nodes
...
Option 1:
Split resources according the the fair-share quotas and allow each experiment to submit to the other partitions on a pre-emptable basis. NOTE: pre-emptable means job KILLED, not checkpointed
Each experiment has their own quota and we delegate to them to claim any resource not used by another experiment