Tags: view all tags

ATLAS resources federation

* Zoom link: https://ethz.zoom.us/j/91556179990

ATLAS resources federation
- Technical meeting with the following goals:

Plan to federate two ATLAS sites into one.

CSCS-LCG2 (dCache), UNIBE-LHEP (DPM) => CHIPP-CH (DPM)

Note: the CSCS storage will physically remain at CSCS

Step 1 storage:

Drain internally one dCache "storage unit" at CSCS, re-install it as a DPM "storage unit" and attach it to the Bern DPM head node
Operate in this mode for a minimum of 6-8 weeks
If no blocking issues are discovered: transition all storage pools from dCache to the DPM head node in Bern.
Strategy to define, but can hope to drain and re-install as DPM (most of) them one by one internally if possible (to reduce load on DDM ops). This will shrink the size of the CSCS storage and increase that of the Bern storage.
At some point during this procedure, move CSCS Panda queues to the Bern storage
Make CSCS storage RO and finalise its draining and pool transition to the Bern storage

Step 2 create the new ATLAS site:

Transition the Bern DPM endpoints to the new site

Step 3 Panda queues

Move the Panda sites CSCS-LCG2 and UNIBE-LHEP (or create new ones) to the ATLAS site CHIPP-CH

Technical meeting with the following goals:

Understand the Federation layout

Swiss-ATLAS-Federation-layout.pdf : Swiss ATLAS Federation layout

Understand CSCS storage layout for dCache and how to map it to DPM

DPM-internal-layout.pdf: DPM internal layout

Understand the network layout between the two sites
https://traffic.lan.switch.ch/vip/swiss-map/index.html

https://traffic.lan.switch.ch/vip/international-map/

Direct link between Bern and Lugano with 100G capacity. Currently limited at 40G at the Bern border
Bern SE to CSCS SE path:
[root@dpm ~]# traceroute se33.cscs.ch
traceroute to se33.cscs.ch (148.187.19.183), 30 hops max, 60 byte packets
 1  beethoven-67.unibe.ch (130.92.67.1)  0.342 ms  0.876 ms  0.276 ms
 2  castorfw-inside.unibe.ch (130.92.0.36)  0.825 ms  0.815 ms  0.849 ms
 3  castor-inside.unibe.ch (130.92.244.3)  1.008 ms  0.997 ms  1.366 ms
 4  swiBE3-40GE-0-1-0-0-0.switch.ch (195.176.3.1)  1.317 ms  1.305 ms  1.254 ms
 5  swiLG1-100GE-0-0-0-3.switch.ch (130.59.36.102)  4.201 ms  4.200 ms  4.157 ms
 6  100G-C-IPv4.cscs.ch (148.187.0.10)  5.976 ms  3.717 ms  3.782 ms
 7  se33.cscs.ch (148.187.19.183)  3.542 ms  3.509 ms  3.306 ms
Bern SE to CSCS CE path (probably behind firewall):
[root@dpm ~]# traceroute arc04.lcg.cscs.ch
traceroute to arc04.lcg.cscs.ch (148.187.19.136), 30 hops max, 60 byte packets
 1  beethoven-67.unibe.ch (130.92.67.1)  0.317 ms  0.310 ms  0.271 ms
 2  castorfw-inside.unibe.ch (130.92.0.36)  0.830 ms  0.831 ms  0.930 ms
 3  castor-inside.unibe.ch (130.92.244.3)  1.464 ms  1.551 ms  1.527 ms
 4  swiBE3-40GE-0-1-0-0-0.switch.ch (195.176.3.1)  1.110 ms  1.547 ms  1.049 ms
 5  swiLG1-100GE-0-0-0-3.switch.ch (130.59.36.102)  4.303 ms  4.297 ms  4.271 ms
 6  * * *
Bern SE to the outside, e.g. NDGF
[root@dpm ~]# traceroute piggy.ndgf.org
traceroute to piggy.ndgf.org (109.105.124.142), 30 hops max, 60 byte packets
 1  beethoven-67.unibe.ch (130.92.67.1)  0.362 ms  0.314 ms  0.283 ms
 2  castorfw-inside.unibe.ch (130.92.0.36)  0.234 ms  0.258 ms  0.189 ms
 3  castor-inside.unibe.ch (130.92.244.3)  0.824 ms  0.810 ms  0.882 ms
 4  swiBE3-40GE-0-1-0-0-0.switch.ch (195.176.3.1)  0.703 ms  1.678 ms  1.675 ms
 5  swiCE4-100GE-0-0-0-2.switch.ch (130.59.37.146)  3.687 ms  3.363 ms  3.693 ms
 6  swiCE1-B4.switch.ch (130.59.36.69)  3.650 ms  5.370 ms  5.344 ms
 7  switch.mx1.gen.ch.geant.net (62.40.124.21)  3.108 ms  3.107 ms  3.103 ms
 8  ae6.mx1.par.fr.geant.net (62.40.98.183)  12.194 ms  12.023 ms  11.990 ms
 9  ae5.mx1.lon2.uk.geant.net (62.40.98.178)  16.713 ms  16.657 ms  16.860 ms
10  ae6.mx1.lon.uk.geant.net (62.40.98.36)  17.574 ms  17.694 ms  17.381 ms
11  nordunet-gw.mx1.lon.uk.geant.net (62.40.124.130)  17.545 ms  17.531 ms  17.517 ms
12  dk-uni.nordu.net (109.105.97.126)  37.190 ms  37.187 ms  36.999 ms
13  dk-ore.nordu.net (109.105.97.132)  37.446 ms  38.076 ms  37.996 ms
14  dk-ore2.nordu.net (109.105.102.119)  53.793 ms  49.192 ms  49.131 ms
15  piggy.ndgf.org (109.105.124.142)  37.613 ms  37.508 ms  37.412 ms
Work in progress: move Bern ARC CEs and SE nodes in DMZ

CSCS SE to Bern SE path:
...

CSCS SE to Bern path:
...

CSCS SE to outside, e.g. NDGF:
...
Understand the current and expected network rates for SE to compute and WAN

Network Run-2:

Assume analysis on an HT-Core (job-slot) consumes 1.2 MBytes/sec

Implies job-slots need that level of network bandwidth to storage

WAN access to remote storage at 20% (ATLAS avg now)

Nominal Tier-2: 5000 job slots => 6 GBytes/sec, WAN 9.6 Gbits/sec

Leadership Tier-2: 10000 job slots => 12 GBytes/sec, WAN 19.2 Gbits/sec

NOTE: Run-3 will have 3-4 times the data...have to either increase cores or improve average software throughput by that factor

Network Run-3:

Add a burst caspability

Nominal Tier-2 WAN: 9.6 Gbps x 3 = 28.8 Gbps => 40G link

Leadership Tier-2: 10000 job slots => 9.2 Gbps x 3 = 57.6 Gbps => 80G link

Plan concretely the first step

Drain internally one dCache "storage unit" at CSCS, re-install it as a DPM "storage unit" and attach it to the Bern DPM head node

Lay a tentative plan out for the following step

AOB

MEETING NOTES (*)

The federation plan will proceed in stages: first stage is to make available one storage server from the CSCS “lab” in a dedicated VLAN so that Gianfranco can have root access without impacting security. CSCS provides a basic CentOS 7 OS, Gianfranco the DPM stack and configuration. This will be attached to the Bern DPM head node and set in production for a minimum of 6-8 weeks in order to discover and correct possible issues encountered. During this time, a strategy for migrating the rest of the storage will be thought of, feeding back to ATLAS DDM who will provide additional manpower

Action on Gianfranco: create RT ticket

Possible strategies: incremental migration of “storage blocks” by internal draining and re-install, or mass-data migration performed by ATLAS (requires some additional buffer disk space at CSCS). Or a combination of the two. Mass data migration of 1.5-1.8 PB is expected to last a few months, depending on the size of the additional buffer. Three is a realistic number

ATLAS understand its Swiss storage might not be full size and/or performant during the migration period. No impact on CMS and LHCb in the incremental migration scenario (when a subset of disks are removed, so are the corresponding data and r/w ops). Positive transient impact in case of mass-data migration (data and r/w are being removed but the disks remain attached to the storage)

If no blocking issues discovered: carry on with the full migration. Manpower foreseen: ATLAS DDM (intensive), ATLAS CH (fair), CSCS (minor).. all is considered operational effort

Manpower foreseen after the federation is implemented: ATLAS DDM none (except routine operational support); ATLAS CH some additional epsilon to support and maintain/upgrade the remote nodes, GGUS tickets, etc; CSCS an epsilon less for not needing to support the ATLAS requests on dCache and the GGUS storage related issues, which will be Bern responsibility. No additional monitoring needed at CSCS (the disks remain the same as before). Bern expects to absorb the additional load within their operational effort (T.B.C.).

Good communication will be maintained (as it is now) for routine support of the underlying hardware (e.g. outages, communication of downtimes, etc).

Target for end of transition to a full federated site (including compute): end of calendar year.

Additional technical details clarified:

No changes will occur in the GOCDB site definitions

No changes will occur in the ARC configuration, neither on Piz Daint (de-coupled from the storage by the ARC data staging and cache)

At some point (to be decided) during the migration, ATLAS will switch its storage configuration so that It will use the Bern DPM. From that point on, there will be no more r/w from ATLAS to dcache

Network between the sites and between each site and Geant is understood and raises no concerns for the next several years vs the expected rates predicted for 2022 onwards. Bern is working on removing the current 40Gb/s limit at their border. Will also check with the Uni about the possibility of a dedicated link to decouple from “student” activities. Routing all traffic to bypass the internal routing and the firewall (DMZ) should do a lot for mitigating that possible interference (e.g. p2p traffic from other activities won’t be routed via BE DMZ to CSCS or Geant). CSCS is close to have a second 100Gb/s link to Geneva / Geant

(*) The notes omit unrelated political issues that were unneded and outside the scope of the meeting and include additional detail that could not be discussed as a result..

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	DPM-internal-layout.pdf	r1	manage	839.2 K	2020-07-21 - 08:47	GianfrancoSciacca	DPM internal layout
pdf	Swiss-ATLAS-Federation-layout.pdf	r1	manage	235.0 K	2020-07-20 - 08:08	GianfrancoSciacca	Swiss ATLAS Federation layout