Swiss Grid Operations Meeting on 2021-05-20 at 14:00
Next meeting: 17.06 @ 14h00 <<-- as usual we can change the date to maximize the attendance (in particular the one of the VOreps)
Minutes:
ATLAS report
Nick: The numbers presented are computed considering 6910 expected slots while the number should read 5963. Can ATLAS explain where is 6910 coming from ? ATLAS Pledge = 74240 HS06. Cores = Pledge/HS06 = 74240/12.45=5963 cores. April KHS06 pledge hours = Pledge *
HoursInDay *
DaysInMonth / 1000 = 74240 * 24 * 30 / 1000 = 53452.8. Per CRIC Generated = 57757.977. This means pledge for April was exceeded.
Mauro: the numbers reported from Gianfranco and from Nick are constantly off. We need to converge once and for all on a common source and stick to that to avoid wasting everybody's time and energy in trying to match them.
Gianfranco: Simple and not obscure MATH, I am surprised questions like this arise and the chair lets them arise and bugs me about them (but I have seen even worse): CRIC pledge / HS06 coefficient => Number of cores. Simple MATH. Please NOTE: no private pledge numbers have any role in ATLAS/WLCG. Do your own private scaling among yourselves please, my time is as much wasted as it is yours, or even more to be dealing with such petty issues.
CSCS is at 95% (with the usual overestimation error folded in) of pledge for April 2021. That is not tragic, but is NOT above 100%. In order to have numbers match, it is sufficient not to have private versions of the relevant metrics.
CSCS report
New monitoring up from mid-April (spikes are coming from http timeouts - fixing it)
All VOs are above 100%: we are using 10 extra nodes to cope for possible downtimes. Extrapolating from the load we have so far, May should be still above 100% inspite of the problems occurred when coming out from the maintenance period.
ATLAS DPM migration
From the answer in the minutes of the 11.03.2021 meeting:
- What is the plan and timeline to move from test phase to production? (Gianfranco: will follow WLCG/ATLAS additional recommendations. More solutions need to be evaluated)
- How much of the ATLAS workload is using DPM at CSCS? (Gianfranco: the DPM capacity at CSCS is 11% of the ATLAS storage for UNIBE-LHEP)
This still doens't answer how much of the workload is using it / is there anybody using it ?
Gianffranco: ATLAS is using it. It has been reported MULTIPLE times. If there is not an understanding about how experiments use storage at sites, you could set a workshop up for that. Then yuou could also perhaps report "how much of the workload is using" the ATLAS storage at CSCS.
DCACHE hw has started to arrive —> turn OFF access to test-DPM on Jun 2nd
Mauro: I would like to understand what happens on the ATLAS workflows when the test-HW is removed
Gianfranco: Please note: CSCS have insisted for years that all communication about r&d projects
must occur via CHIPP. This is no exception. As such, I have written to the CHIPP chair asking for an official and recorded communication. Should CSCS want to end the ongoing production project despite its success: Send an official communication to ATLAS CH (e.g. me) , including a brief motivation so that we can pass that to the upstream. Following an handshake, we will arrange for the ATLAS data migration away from CSCS. This must be scheduled.
In addition, NOTE: Arbitrarily removing access to data will "have consequences". Outside of the private version of WLCG that is being showcased here with such random shoutouts (and has no precedents in the history of the LHC experiments), ATLAS data
belong to ATLAS. And ATLAS service providers are bound to adhere to the rules and code of conduct, not to mention
MoU of the official WLCG. Not to an arbitrary and private CHIPP version of it.
CMS report
nothing major, now waiting the system to come back
GPU work is scheduled to begin next week
LHCb report
nothing major, waiting for the system to come back
Followup from previous Action Items
Action items
ATLAS
CMS
LHCb
T2 Sites reports
CSCS
T3 Sites reports
PSI
EGI / WLCG
Review of open tickets
https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO
Ticket-ID |
Type |
VO |
Site |
Priority |
Resp. Unit |
Status |
Last Update |
Subject |
Scope |
152076 |
|
atlas |
UNIBE-LHEP |
less urgent |
NGI_CH |
assigned |
2021-05-20 |
Job failures at UNIBE-LHEP |
WLCG |
152070 |
|
cms |
CSCS-LCG2 |
urgent |
NGI_CH |
assigned |
2021-05-20 |
SAM tests failing at T2_CH_CSCS |
WLCG |
151997 |
|
cms |
CSCS-LCG2 |
urgent |
NGI_CH |
assigned |
2021-05-14 |
WebDAV protocol deployed (T2_CH_CSCS) |
WLCG |
152033 |
|
cms |
CSCS-LCG2 |
urgent |
NGI_CH |
in progress |
2021-05-20 |
Erroneous consistency check endpoint at ... |
WLCG |
150373 |
|
dune |
UNIBE-LHEP |
less urgent |
NGI_CH |
in progress |
2021-05-14 |
Enable DUNE queue for CPU and future ... |
EGI |
144485 |
|
none |
CSCS-LCG2 |
less urgent |
NGI_CH assigned |
in progress |
2021-04-14 |
Upgrade to recent dCache release |
EGI |
151265 |
|
cms |
CSCS-LCG2 |
less urgent |
NGI_CH |
on hold |
2021-04-09 |
Enabling WebDAV on Production ... |
WLCG |
a.o.b
- Attendants
- CSCS: Colin, Dario, Nick, Pablo
- CMS: Derek, Mauro
- ATLAS:
- LHCb: Roland
- EGI: