Swiss Grid Operations Meeting on 2022-05-12 at 09:00
*Next meeting:
F2F 2 June 2022
Minutes:
EGI:
still investigating with
EnhanceR how to deal with the fee and the 10% FTE
ATLAS:
heavy under-deliver ~35% of the pledge in April
share under 19%
storage at bern decommissioned and integrated into nordugrid (resources are still physically in bern)
CMS:
Mont Fort added to the CMS workflows
no open tickets
overall situation looks OK
A number of file transfer are failing: issue identify in the way davs manages the VOMS attributes in third party transfer. Only newly transferred files are added to the storage space, existing data cannot be added. This lead to a micromanagement of the free space. Just today we may have received a solution for the purpose of managing small reservation / deleting old data
CMS cache storage space usage: proposed a new space allocation to meet the required 75% required to be managed by cps-central
Tickets are proactively taken care from CSCS: thanks !
LHCb:
All fine on Daint. Usual 10% failing pilots that disappeared at the beginning of last month. Good but unclear what happened...
Mont Fort: some troubles to submit with the standard python script, but they figured out that the was a typo.
CSCS:
Daint: pretty bad month… confirming the numbers found by ATLAS (not so bad for the others)
Why ATLAS is so bad ? unclear but related to the reco jobs.
Mon Fort:
April 10 nodes with 128 cores 512 + 4 nvidia A100 + 10kHS06
May 32 nodes with 256 cores 512 GB/RAM + 90kHS06
Share: 66% of the resources to ATLAS to catch up + 15% / 15% CMS/LHCb
Temporary adding cluster on Mont Gele
same config as Mont Fort but with HDD=-based storage
Additional +90kHS06
ALPS down time in July to replace some network components suggest to keep Daint for a while with reduced capacity until July migration. Is that OK ?
Action items and points for discussion:
increase resources for TALAS on Mont Fort
Daint: around 20% of the resources (it won’t fix the reco jobs)
Summing the two should help
Must add asap Slurm and ARC monitoring on Mont Fort Mont Gele
If ATLAS sees reco jobs not running, it doesn’t send anything else because assumes that the site is full
ATLAS
CMS
LHCb
T2 Sites reports
CSCS
UNIBE
T3 Sites reports
PSI
EGI / WLCG
Review of open tickets
https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO
4 of 4 Tickets |
Ticket-ID |
Type |
VO |
Site |
Priority |
Resp. Unit |
Status |
Last Update |
Subject |
Scope |
156799 |
|
lhcb |
CSCS-LCG2 |
very urgent |
NGI_CH |
in progress |
2022-05-04 |
Pilots Failed at CSCS-LCG2 |
EGI |
156213 |
|
atlas |
CSCS-LCG2 |
less urgent |
NGI_CH |
in progress |
2022-03-07 |
CSCS-LCG2: Non-operational storage ... |
EGI |
154102 |
|
dune |
UNIBE-LHEP |
less urgent |
NGI_CH |
on hold |
2021-12-22 |
Local accounting for DUNE jobs at ... |
EGI |
150373 |
|
dune |
UNIBE-LHEP |
less urgent |
NGI_CH |
on hold |
2021-12-22 |
Enable DUNE queue for CPU and future ... |
EGI |
- Attendants
- CSCS:
- CMS:
- ATLAS:
- LHCb:
- EGI: