Tags:
create new tag
view all tags

Fair-share Meeting on 2018-11-13

  • Date and time: 13 November 2018 14:00-15:00. (UTC+01:00) Belgrade, Bratislava, Budapest, Ljubljana, Prague
  • Place: CSCS Meeting Room 1st Floor (F1)
  • External link:

    Web Portal Address: https://vcmeeting.ethz.ch

    SCOPIA meeting ID 6708365

    SCOPIA via phone +41 43 244 89 30 | 6708365#

Agenda

  • Fair-share problem introduction (ATLAS)
    • Share issue first flagged on Piz Daint during the LHConCray commissioning project (1 year ago)
      • Not enough job pressure from CMS
      • Relative shares between ATLAS and LHCb skewed in favour of LHCb
    • Raised again at the f2f meeting on 21st June in ZH
    • In September realised that the issue has shown up on Phoenix too since ~May 2018
    • Hard to keep track of, since monitoring dashboards cannot be accessed
    • Did some investigations with Dino and discussed further f2f with Pablo & Dino again
What is the fair-share problem?
    • ATLAS MultiCore jobs wait too long in the queue compared to single core jobs
      • ATLAS: ~80% MC, ~20% SC
        • 1 job=1 payload
        • internal fair-share done at the factory level, passed to the sites in the form of an ARC job option => lowers priority
        • walltime request passed to the sites in the form of an ARC job option (tuned to the payload to be executed)
      • CMS: 100% MC
        • 8-core (configurable) pilots sent to the sites
        • internal fair-share done at the factory level, 8-core pilots pull multiple MC and SC payloads
        • walltime request configured at the factory level (arbitrary number)
      • LHCb: 100% SC
        • 1-core pilots sent to the sites
        • internal fair-share done at the factory level, 1-core pilots pull SC payloads
        • wall request configured at CSCS. NOTE: this can be done at the factory level (arbitrary number)

Why does that happen?:
    • Common problem to the large shared sites: SC vs MC scheduling: node fragmenting and backfill favour SC
    • SC slots are held due to the long running configured Walltime
    • SLURM is not an HTC schduler, in the conditions shown above it is hard to judge whether it makes the right scheduling decisions according to its target settings
    • Factors that have an impact:
      • SC vs MC imbalance (per user)
      • cputime imbalance (per user)
      • backfill (although this should favour shorter jobs, these are MC jobs)
      • ATLAS job nice-ing (however this is turned off on Daint, but we need it)
      • Number of queued jobs (per user): is this balanced?
Impact on ATLAS

Attendants

  • Roland
  • Christoph
  • Ginafranco
  • Thomas
  • Stefano
  • Nicholas
  • Dino
  • Gianni
  • Miguel

Minutes

  • item

Action items

  • item
Topic attachments
I Attachment History Action SizeSorted ascending Date Who Comment
PDFpdf CHIPP_Job_Analysis.pdf r1 manage 8426.7 K 2018-11-13 - 15:08 NickCardo  
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r4 - 2018-11-13 - GianfrancoSciacca
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback