ATLAS GPU Challenge on 2020-03-05

  • zoom link

Agenda

  • ATLAS description of the GPU challenge
    • Testing of containerised AI and DL workflows on GPU on the Grid

      • Activity by ADC (WMS)+software groups. Original call:

        Dear All,

        since the New Year is all about starting fresh with new challenges and ideas, we’d like to throw a challenge for you. This is inspired by some recent successes:
        (1) Lukas and Alessandra have been able to run DL training (including hyperparameter optimisation) on GPUs in containerised jobs, with the resources accessed via grid machinery
        (2) Rui and Sau Lan have been able to run DL training on Summit using the GPUs, but by direct personal access, not via the grid

        The challenge is as follows: can we combine the two activities such that we could use grid machinery and containerised jobs to access the GPUs on a “difficult” HPC in the US (such as Summit), to run some user-submitted DL training application? I have no idea how difficult this is - presumably there will be all kinds of policy issues related to running containers and network access on a machine such as Summit - but I think this is a good practical challenge to enable us to start using these machines for real, in a way that could help physics analyses and combined performance.

        Please let us know what you think of this plan. If you like it and think it is worth pursuing, please forward this to whoever you think would be interested. Perhaps we could have some kind of talk or session on this at the Lancaster SW&C meeting in June?

        Cheers,

        James & Ale

      • Aiming at having ~10 users succesfully running small scale workflows on the Grid (volunteered resources)
        • Make GPUs "popular" with interested users
        • Develop WMS integration solutions to fully support such workflows

    • Technical aspects, proposed starting configuration for CSCS
      • 1 or more GPUs available
      • easier if dedicated partition, more complex (but doable) if using one of the existing partitions (might attract unwanted jobs on the GPU node(s), e.g. ops tests etc.)
      • ARC5+RTE (*) ok to request 1GPU statically. With ARC6 can dynamically adjust requests.
      • Preferred to start with a "generic node": no CVMFS, singularity, network connectivity not clear, hopefully not needed (or could be via squid proxies).
      • Container built off-site, staged in via ARC.

    • Request to CSCS (needed for the PanDA queue definition)
      • Name of partition to use to start testing (and maxtime)
      • Node specs (CPU, GPU, MEM)

    • (\*) RTE for ARC5:

      ###########################
      ### RTE Code ###
      ###########################
      if [ "x$1" = "x0" ]; then
      # RTE Stage 0
      # You can do something on ARC host as this stage.
      # Here is the right place to modify joboption_ variables.
      # Note that this code runs under the mapped user account already!
      # No root priveledges!
      export joboption_nodeproperty_0="${joboption_nodeproperty_0} --constraint=gpu --gres=gpu:1"

  • Discussion

Attendants

  • person

Minutes

  • item

Action items

  • item
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2020-03-03 - GianfrancoSciacca
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback