<!-- keep this as a security measure:
* Set ALLOWTOPICCHANGE =
TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
* Set ALLOWTOPICRENAME =
TWikiAdminGroup,Main.LCGAdminGroup
#uncomment this if you want the page only be viewable by the internal people
#* Set ALLOWTOPICVIEW =
TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->
Swiss Grid Operations Meeting on 2019-03-07 at 14:00
Site status
CSCS
Systems
-
Phoenix: all nodes idle, load moved to Daint
-
ARC CEs reinstalled with new HW, only arc04 missing. Scheduled reinstallation of arc04 on Monday
-
Testing new squid and scheduled reinstallation of cvmfs and cvmfs1
-
Old compute nodes distribution?
- lhcb want 2x Chassis wiht 4x nodes each
- Atlas will take around 100x nodes
Storage
dCache
- normal operation
- preparing lab for new design + upgrade
- complete storage migration will start in April
GPFS
- normal operation
- Upgraded firmware on DELL SC9000 (SSD tier1)
- next week: rolling / online network interface replacement (from IB to Ethernet)
- upgrade to GPFS 5.0.2
- Slow tier migration in April from DDN SFA12k to DELL SC9000
PSI
UNIBE-LHEP
- Accounting numbers (from scheduler) from last month, LHEP only
Swiss ATLAS statistics
- Hammercloud availability:
-
- ANALY_CSCS-HPC: 95%
- CSCS-LCG2-HPC_MCORE: 94.5%
- UNIBE-* : 100%
- Running slots
- Large number of stuck jobs on ARC skew the statistics for CSCS, creating reporting problems to ATLAS
- Very likely due to the reported issues with the Daint scratch file system, affecting WLCG jobs in some way
- This required a laborious manual clean up:
- job list provided by ATLAS, culled from the aCT
- manual cleanup of the ARC sessiondir carried out by Miguel
-
- Accounting Numbers from the ATLAS dashboard (February 2019) CSCS+UNIBE
- Take home lessons from the last month:
- Failed WC very high, we need some more real time alerts
- Public dashboard replica offline for a while
- ATLAS now relies fully on ARC services:
- we need ARC metrics and/or logs
- some ideas by Dino, but such implementation is not a short term affair
- what can we do in the meanwhile?
- we need monitoring/(nagios, graylog?) automated checks on ARC services
- Please report general Daint issues that could affect WLCG jobs (SLACK-general or email) so that we can react if needed
- ...
UNIBE-ID
- End of Feb: Reinstallation of ARC CE on nordugrid.unibe.ch AKA UNIBE LHEP_UBELIX:
- Reason: EL6 -> EL7
- smooth transition within ~4h
- no issues after reinstallation
UNIGE
- Xxx
- Accounting numbers (from scheduler) from last month
NGI_CH
- Xxx
- NGI-CH Open Tickets review
Other topics
Next meeting date:
A.O.B.
Attendants
- CSCS:
- CMS:
- ATLAS:
- LHCb:
- EGI:
Action items