Tags: view all tags

ATLAS GPU Challenge on 2020-03-05

###############################
## CSCS Meeting room 3rd floor (LCA)

Join by Zoom meeting:
https://cscs.zoom.us/my/cscsmr3f
Host key: 636363

Join by phone:
Switzerland: +41 43 210 70 42 | Meeting ID 9161082630
International phone access numbers: https://zoom.us/u/amdNZ9YRF

Join by SIP:
9161082630@zoomcrc.com

Agenda

ATLAS description of the GPU challenge

Testing of containerised AI and DL workflows on GPU on the Grid
- Activity by ADC (WMS)+software groups. Original call:
  
  Dear All,
  
  since the New Year is all about starting fresh with new challenges and ideas, we’d like to throw a challenge for you. This is inspired by some recent successes:
  (1) Lukas and Alessandra have been able to run DL training (including hyperparameter optimisation) on GPUs in containerised jobs, with the resources accessed via grid machinery
  (2) Rui and Sau Lan have been able to run DL training on Summit using the GPUs, but by direct personal access, not via the grid
  
  The challenge is as follows: can we combine the two activities such that we could use grid machinery and containerised jobs to access the GPUs on a “difficult” HPC in the US (such as Summit), to run some user-submitted DL training application? I have no idea how difficult this is - presumably there will be all kinds of policy issues related to running containers and network access on a machine such as Summit - but I think this is a good practical challenge to enable us to start using these machines for real, in a way that could help physics analyses and combined performance.
  
  Please let us know what you think of this plan. If you like it and think it is worth pursuing, please forward this to whoever you think would be interested. Perhaps we could have some kind of talk or session on this at the Lancaster SW&C meeting in June?
  
  Cheers,
  
  James & Ale
- Aiming at having ~10 users succesfully running small scale workflows on the Grid (volunteered resources)
  - Make GPUs "popular" with interested users
  - Develop WMS integration solutions to fully support such workflows
Technical aspects, proposed starting configuration for CSCS
- 1 or more GPUs available
- easier if dedicated partition, more complex (but doable) if using one of the existing partitions (might attract unwanted jobs on the GPU node(s), e.g. ops tests etc.)
- ARC5+RTE (*) ok to request 1GPU statically. With ARC6 can dynamically adjust requests.
- Preferred to start with a "generic node": no CVMFS, singularity, network connectivity not clear, hopefully not needed (or could be via squid proxies).
- Container built off-site, staged in via ARC.
Request to CSCS (needed for the PanDA queue definition)
- Name of partition to use to start testing (and maxtime)
- Node specs (CPU, GPU, MEM)
(\*) RTE for ARC5:
###########################
### RTE Code ###
###########################
if [ "x$1" = "x0" ]; then
# RTE Stage 0
# You can do something on ARC host as this stage.
# Here is the right place to modify joboption_ variables.
# Note that this code runs under the mapped user account already!
# No root priveledges!
export joboption_nodeproperty_0="${joboption_nodeproperty_0} --constraint=gpu --gres=gpu:1"

Discussion

CSCS questions to ATLAS:

- Users (who will be submitting the jobs?)

- Same as now. We only use one central submission engine (aCT/harvester), with either prod or analysis proxy per each job. Actual users behind the actual payload (for analysis, this case) are encoded in the job name.
- Contact person from each side (project coordination)
- There is no formal project. This is an activity in the frame of the WMS working group. ATLAS contact in CH for WMS is myself. Please clarify whether by proxy or not.
  - Process for communicating the results
- Not sure, I suppose the CH ops monthly meetings? Within ATLAS there is a weekly WMS meeting and regular internal channels (ADC weekly, TCB weekly, quarterly S&C weeks, yearly TIM, etc)
- [Mauro] I understood later that the question was whether the results will be published or else made public in talks.
  - Definition of the success criteria
- As any other ATLAS prod and analysis workload.
- [Mauro] I understood later that the question points to the definition of staged goals. (I would see something like building/deploying the contatiner, job submission, etc.. more than the actual results NN of the training)
  - Monitoring
- bigpanda.cern.ch, ATLAS monit-grafana.cern.ch
  - Impact on CHIPP workload (e.g. CVMFS and Scratch would be shared)
- Not sure but I cannot think of anything special. We’ll find out as we go. Actually: we aim at not using CVMFS at all, hopefully we will succeed.
  - Workload size and entry point (e.g. special ARC, queue size)
- Workload is not unique (as all analysis jobs). I can look up a random job running e.g. at Manchester now, I see this:
- Dataset summary: input: 31, size: 290.48(MB); log: 1
- Existing ARC is fine. Existing queue size is also ok. Probably a separate queue is desirable, but not mandatory.