Meeting about using GPUs or optimizing costs on Central Storage at CSCS

Follow-up discussion regarding use of accelerators (GPUs, FPGAs...) and reducing/eliminating the central storage needs. Both topics can hardly be solved by ourselves, so please discuss with your organization about these topic before joining the meeting and let us know what their input is.

Mauro/CMS

On data lakes, the community is moving in the datalake direction: large sites with big storage
Disconnecting storage from compute technicallly does seem to be possible (due to the bandwidth). Policically, there would be an effort to align ourselves to the other sites. One community cannot do it on their own.
On the GPU side, CMS is also moving in that direction. One example: the HLT (high level trigger) effort to moving from CPU to GPU is considerable. A CMS team (patatrack) has this idea.
Done in part of the event reconstruction, other parts are a bit more behind. Main issue is the expertise to CUDA coding, which is limited. Using RECON on GPUs seems to be reachable, but so far we had baby steps: not there yet.
Geant is CPU only (some efforts are being done to vectorize in Geant5 and GeantV).
The option for us to offer GPUs would be on a project-specific request. These would have to come on deman and only for the duration of the project.

Gianfranco/ATLAS

Both topics have been part of the last WLCG workshop (Joing HSF workshop) (links below)
On the GPU side, for ATLAS specifically, there is a limited effort on this direction. 60-70% of ATLAS CPU is spent on event generation and Geant, which is not portable.
So far a handful of sites have made available a small amount of GPUs on the batch system.
One of the problems is accounting, which is not prepared to handle GPUs!! Sites are not motivated if they are not accounted for. This is now on the table and will probably progress on these workshops.
HLT is also moving towards GPU but not yet as advanced as CMS. FPGAs are very little mentioned, far less than GPUs

Concerning storage, the situation is more concrete:

There is a WLCG project called DOMA to analyze the storage models, including the amounts and costs. Data Access Content Delivery Access is an activity that maybe worth looking. ATLAS is already doing ARC caching.
Monolythic storage seems to be the least popular: a datalake would be a collection of storage data platforms. Sites would then choose in which tier of storage they want to sit on.
In order to relieve the dCache costs, the only thing Gianfranco can think of is to manage the disks remotely (e.g. by NDGF) to relieve the admin pressure. Another possibility is to federate storage within Switzerland. Also, Italy has a distributed DPM across 4 sites and France is commissioning a similar setup for production.
Sites have the option to run without storage, but then the storage has to come from someone else... But experiments made clear they would like to make sure that the message to funding agency is NOT that the storage is not needed, because it actually is.
One problem of the Cache space is that's not yet pledged, but people are aware of this, so it might change. If the model evolves to consider cache storage as real storage, and makes use of it more activelly, then eventually one could reduce the permanent storage with more redundancy and therefore reduce costs.
If we get rid of the storage we would also stop being a NUCLEUS site.
Gianfranco suggests we set a foot one one of the DOMA activities by considering a low-effort storage test bed that uses federated storage components and the ARC caches we already operate.

Roland / LHCb

Not much code that runs on GPUs, for the same reasons than ATLAS and CMS.
From LHCb point of view, the storage can be anywhere. But CSCS storage is actual storage, not cache.
Funding agencies have issues funding hardware that is used by other countries: essentially, we have to provide our own.

Varia

Pablo notes that federating the storage (having it operated by someone else) to save in manpower is all-or-nothing setting: if we need to operate "some" storage, the manpower effort is very similar. And, in any case, the storage costs (servers, disks...) will still be there.
Pablo reiterates the possibility to set the balance of the computes/storage ratios (mostly) as needed, and asks the VOs to double check their assigned quantities to see if they are appropriate
As for the chicken-egg problem on GPUs on sites, our option could be to deploy one or two GPUs. The problem is having empty GPUs on the table will not really help.
Christoph notes that we should let the VOs drive this efforts and be supportive if need be. He also suggests that we should be thinking about how to optimize the resources we have, rather than starting R&D activities, at least for now.

Useful links:

WLCG DOMA: https://twiki.cern.ch/twiki/bin/view/LCG/DomaActivities#Get_Involved
WLCG DOMA caching activity: https://twiki.cern.ch/twiki/bin/view/LCG/ContentDeliveryCaching
Inter-experiment machine learning group: https://iml.web.cern.ch
WLCG HSF workshop March'19: https://indico.cern.ch/event/759388/timetable/#20190318.detailed
WLCG HSF workshop November'19: https://indico.cern.ch/event/805983/overview
French DPM federation: https://indico.cern.ch/event/773049/contributions/3474420/

Attendants

CSCS: Nicholas, Dino, Pablo
CMS: Christoph, Mauro, Vinzenz, Nina
ATLAS: Gianfranco
LHCb: Roland

This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingFutureStorageCompute20191121
Topic revision: r2 - 2019-11-22 - PabloFernandez