Tags: view all tags

Swiss Grid Operations Meeting on 2019-03-07 at 14:00

Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598)
External link: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE
Phone gate: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
Switch Vidyo SIP IP: 137.138.248.204

Swiss Grid Operations Meeting on 2019-03-07 at 14:00
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - Swiss ATLAS statistics
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

Systems

Phoenix: all nodes idle, load moved to Daint
ARC CEs reinstalled with new HW, only arc04 missing. Scheduled reinstallation of arc04 on Monday
Testing new squid and scheduled reinstallation of cvmfs and cvmfs1
Old compute nodes distribution?
- lhcb want 2x Chassis wiht 4x nodes each
- Atlas will take around 100x nodes

Storage
dCache

normal operation
preparing lab for new design + upgrade
complete storage migration will start in April

GPFS

normal operation
Upgraded firmware on DELL SC9000 (SSD tier1)
next week: rolling / online network interface replacement (from IB to Ethernet)
upgrade to GPFS 5.0.2
Slow tier migration in April from DDN SFA12k to DELL SC9000

PSI

UNIBE-LHEP

Ramping down LHEP in view of the cluster re-deployment
Monthly summary: Pledged: 18k, delivered 18k
Ubelix contributing >50% (23% typical)
Running an average >1850 slots (2500 typical)
6-month history UniBE (pledge: 18 kHS06)

Accounting numbers (from scheduler) from last month, LHEP only
- Omitted this month

Swiss ATLAS statistics

Hammercloud availability:
- ANALY_CSCS-HPC: 95%
- CSCS-LCG2-HPC_MCORE: 94.5%
- UNIBE-* : 100%
Running slots
- Large number of stuck jobs on ARC skew the statistics for CSCS, creating reporting problems to ATLAS
- Very likely due to the reported issues with the Daint scratch file system, affecting WLCG jobs in some way
- This required a laborious manual clean up:
  - job list provided by ATLAS, culled from the aCT
  - manual cleanup of the ARC sessiondir carried out by Miguel
Accounting Numbers from the ATLAS dashboard (February 2019) CSCS+UNIBE

Cluster	Job Type	Produced WC core-hours	Good vs Bad WC %	CPU eff good jobs %
CSCS	Any	3'088'664; 72%	0.52	0.70
UniBe	Any	1'149'338; 28%	0.75	0.75

Take home lessons from the last month:
- Failed WC very high, we need some more real time alerts
- Public dashboard replica offline for a while
- ATLAS now relies fully on ARC services:
  - we need ARC metrics and/or logs
  - some ideas by Dino, but such implementation is not a short term affair
  - what can we do in the meanwhile?
  - we need monitoring/(nagios, graylog?) automated checks on ARC services
  - Please report general Daint issues that could affect WLCG jobs (SLACK-general or email) so that we can react if needed
- ...

UNIBE-ID

End of Feb: Reinstallation of ARC CE on nordugrid.unibe.ch AKA UNIBE LHEP_UBELIX:
- Reason: EL6 -> EL7
- smooth transition within ~4h
- no issues after reinstallation