ATLAS GPU Challenge on 2020-03-05

zoom link

Agenda

ATLAS description of the GPU challenge
- Testing of containerised AI and DL workflows on GPU on the Grid
  - Activity by ADC (WMS)+software groups. Original call:
    
    Dear All,
    
    since the New Year is all about starting fresh with new challenges and ideas, we’d like to throw a challenge for you. This is inspired by some recent successes:
    (1) Lukas and Alessandra have been able to run DL training (including hyperparameter optimisation) on GPUs in containerised jobs, with the resources accessed via grid machinery
    (2) Rui and Sau Lan have been able to run DL training on Summit using the GPUs, but by direct personal access, not via the grid
    
    The challenge is as follows: can we combine the two activities such that we could use grid machinery and containerised jobs to access the GPUs on a “difficult” HPC in the US (such as Summit), to run some user-submitted DL training application? I have no idea how difficult this is - presumably there will be all kinds of policy issues related to running containers and network access on a machine such as Summit - but I think this is a good practical challenge to enable us to start using these machines for real, in a way that could help physics analyses and combined performance.
    
    Please let us know what you think of this plan. If you like it and think it is worth pursuing, please forward this to whoever you think would be interested. Perhaps we could have some kind of talk or session on this at the Lancaster SW&C meeting in June?
    
    Cheers,
    
    James & Ale
  - Aiming at having ~10 users succesfully running small scale workflows on the Grid (volunteered resources)
    - Make GPUs "popular" with interested users
    - Develop WMS integration solutions to fully support such workflows
- Technical aspects, proposed starting configuration for CSCS
  - 1 or more GPUs available
  - easier if dedicated partition, more complex (but doable) if using one of the existing partitions (might attract unwanted jobs on the GPU node(s), e.g. ops tests etc.)
  - ARC5+RTE (*) ok to request 1GPU statically. With ARC6 can dynamically adjust requests.
  - Preferred to start with a "generic node": no CVMFS, singularity, network connectivity not clear, hopefully not needed (or could be via squid proxies).
  - Container built off-site, staged in via ARC.
- Request to CSCS:
  - Name of partition to use to start testing.
  - Useful to know the node specs.
- () RTE for ARC5:
  ###########################
  ### RTE Code ###
  ###########################
  if [ "x$1" = "x0" ]; then
  # RTE Stage 0
  # You can do something on ARC host as this stage.
  # Here is the right place to modify joboption_ variables.
  # Note that this code runs under the mapped user account already!
  # No root priveledges!
  export joboption_nodeproperty_0="${joboption_nodeproperty_0} --constraint=gpu --gres=gpu:1"
Discussion

Attendants

person

Minutes

item

Action items

item

This topic: LCGTier2 > WebHome > MeetingsBoard > AtlasGPUChallenge20200305
Topic revision: r2 - 2020-03-03 - GianfrancoSciacca