Dear All,
since the New Year is all about starting fresh with new challenges and ideas, we’d like to throw a challenge for you. This is inspired by some recent successes:
(1) Lukas and Alessandra have been able to run DL training (including hyperparameter optimisation) on GPUs in containerised jobs, with the resources accessed via grid machinery
(2) Rui and Sau Lan have been able to run DL training on Summit using the GPUs, but by direct personal access, not via the grid
The challenge is as follows: can we combine the two activities such that we could use grid machinery and containerised jobs to access the GPUs on a “difficult” HPC in the US (such as Summit), to run some user-submitted DL training application? I have no idea how difficult this is - presumably there will be all kinds of policy issues related to running containers and network access on a machine such as Summit - but I think this is a good practical challenge to enable us to start using these machines for real, in a way that could help physics analyses and combined performance.
Please let us know what you think of this plan. If you like it and think it is worth pursuing, please forward this to whoever you think would be interested. Perhaps we could have some kind of talk or session on this at the Lancaster SW&C meeting in June?
Cheers,
James & Ale
###########################
### RTE Code ###
###########################
if [ "x$1" = "x0" ]; then
# RTE Stage 0
# You can do something on ARC host as this stage.
# Here is the right place to modify joboption_ variables.
# Note that this code runs under the mapped user account already!
# No root priveledges!
export joboption_nodeproperty_0="${joboption_nodeproperty_0} --constraint=gpu --gres=gpu:1"