Swiss Grid Operations Meeting on 2021-07-08 at 14:00
Next meeting: 5 August 2021
Minutes:
CSCS reorganisation:
- new WLCG platform integration lead by Miguel - replacing Nick’s at operations
- more details/ charts etc.. on the organisation will be posted by Pablo
ATLAS:
- Investigate the discrepancy in storage (1904 vs. 1896 vs. 2100 TB)
- Rucio dump, we have all details, we need to verify them with Dario
- maybe a problem with the token update ? Derek seemed to confirm so
- DPM:
- add all disks on the new server, install dpm stack, then transfer the jobs to the new server and then drop the old one
- timescale for new disk installation by CSCS ~ next 4 to 8 weeks
- ATLAS DPM work can begin following this
- ATLAS would need 180-200 TB of new disk to be set aside for this change (similar to the current size).
LHCb:
-
GGUS need to be addressed faster (after 4 days the evidence of the problem is gone)
- some cores got lost (dip in LHCb monitoring) afterwards discrepancy between #jobs seen by LHCb (~2500) and CSCS (~3000)
- to be checked probably fixed over w/e by an ARC restart
CMS:
- in waiting room / SAM tests failing
- need someone once a day to check GGUS ticket
- CMS dcache fix implemented with Elia
- CSCS needs to validate the solution
CSCS:
- all experiments exceeded the pledged
- DPM discussion --> see ATLAS
Pledges:
- discuss the final numbers with Pablo the last week of July
- CHIPP agreed on a “scenario 6” in Pablos xls, meaning:
- one boost on disk
- compensated by CPU
- precise numbers being finalized
AFTER THE SUMMER BREAK
Brainstorm meeting for:
- T2 requirements to run on ALPS
- Possible move of the PSI-T3 to ALPS (users access management)
Followup from previous Action Items
Action items
ATLAS
CMS
LHCb
T2 Sites reports
CSCS
UNIBE
- steady operation with 2.2x pledge delivery
- lots of concurrent analysis jobs with thousands of files each caused high pressure on lustre. Decided on a full restart.
T3 Sites reports
PSI
EGI / WLCG
- APEL old Message Broker networks switched off on 8th July 2021. The new ARGO Message Service is used now to publish the accounting data. All CEs in CH have been updated to a version that supports AMS: 6.12.0
- EGI ARGO Availability/Reliability report for NGI_CH os fopr June 2021
- Issues with national funding to cover the EGI participation fee and the Operation co-ordination role. A discussion took place, no solutions identified yet.
Review of open tickets
8 of 8 Tickets |
Ticket-ID |
Type |
VO |
Site |
Priority |
Resp. Unit |
Status |
Last Update |
Subject |
Scope |
152847 |
|
atlas |
CSCS-LCG2 |
top priority |
NGI_CH |
assigned |
2021-07-02 |
DE CSCS-LCG2: High number of job faiures |
WLCG |
152819 |
|
cms |
CSCS-LCG2 |
urgent |
NGI_CH |
assigned |
2021-07-07 |
SAM tests for CE failing at T2_CH_CSCS |
WLCG |
152624 |
|
lhcb |
CSCS-LCG2 |
urgent |
NGI_CH |
in progress |
2021-06-29 |
Pilots Failed at CSCS-LCG2 |
WLCG |
152070 |
|
cms |
CSCS-LCG2 |
urgent |
NGI_CH involved |
assigned |
2021-06-14 |
SAM tests failing at T2_CH_CSCS |
WLCG |
151997 |
|
cms |
CSCS-LCG2 |
urgent |
NGI_CH |
assigned |
2021-05-14 |
WebDAV protocol deployed (T2_CH_CSCS) |
WLCG |
151265 |
|
cms |
CSCS-LCG2 |
less urgent |
NGI_CH |
on hold |
2021-04-09 |
Enabling WebDAV on Production ... |
WLCG |
150373 |
|
dune |
UNIBE-LHEP |
less urgent |
NGI_CH |
in progress |
2021-07-08 |
Enable DUNE queue for CPU and future ... |
EGI |
144485 |
|
none |
CSCS-LCG2 |
less urgent |
NGI_CH assigned |
in progress |
2021-04-14 |
Upgrade to recent dCache release |
EGI |
- Attendants
- CSCS: Miguel, Nick, Pablo, Elia
- CMS: Derek, Mauro
- ATLAS: Steven
- LHCb: Roland
- EGI: