Swiss Grid Operations Meeting on 2020-02-20 at 15:30

###############################
## CSCS Meeting room 1st floor (LCA)

Join by Zoom meeting:
https://cscs.zoom.us/my/cscsmr1f
Host key: 656565

Join by phone:
Switzerland: +41 43 210 70 42 | Meeting ID 9161083650
International phone access numbers: https://zoom.us/u/amdNZ9YRF

Join by SIP:
9161083650@zoomcrc.com
###############################
## For people at CERN you can come to my office 32 - 4 C16

Swiss Grid Operations Meeting on 2020-02-20 at 15:30
- Followup from previous Action Items
- VO reports
  - ATLAS
  - CMS
  - LHCb
- T2 Sites reports
  - CSCS
  - UNIBE LHEP/Ubelix
- T3 Sites reports
  - PSI
  - UniGE
- EGI
  - News
  - Review of open tickets
- a.o.b
- Attendants

Followup from previous Action Items

Action items

VO reports

ATLAS

ATLAST2reportJan2020pdf.pdf: ATLAS CH Tier2 report

Minutes:

the pledges were achieved (surpassed) in the past month
while the pledged performance were achieved, shortfall on the 40:40:20 share / see Nick slides to understand why the 40:40:20 was difficult to achieve last month
15-20% of the 1-core EVGEN jobs timeout w/o logs. Debugging with Miguel not cocnlusive yet. Impossible to save the session for all jobs from ATLAS--> try to do it for single core production jobs --> then debug
split the monitoring plot of "Slots of running jobs" in two plots - just not to have the stacked plots. The total pledged is automatically added in the plots, add by hand a dotted line with the CSCS-only pledge
All dips in the plots for both CSCS and Bern discussed and understood
The color coding in the table "ATLAS T2-statistics" is red < 75% < yellow < 90%< green
The color coding in the table "ATLAS Hammercloud statistics" is red < 95% < green
Some details about the Swiss ATLAS federation will be send in the next days

CMS

Minutes:

a ticket has been filed to understand why CMS had so few pending jobs (the initial explanation is that there were no jobs to run in the whole CMS which sounds surprising)

LHCb

Minutes:

LHCb all OK
2 misconfigurations occurred on the LHCb side
affected by the CEPH downtime at CERN
IP6 issues at CERN

T2 Sites reports

CSCS

CSCS January Site Report

January utilization:

Pledges
- 112.2% CHIPP overall
- 104.1% ATLAS
- 101.5% CMS
- 149.9% LHCb
Sharing
- ATLAS:CMS:LHCb [%] 37:36:27

Minutes:

discussion of the dried out queues
Whenever a VO does not submit jobs (e.g. by watching the grafana page) warn CSCS and try to debug the case as much in real time as possible
ATLAS has a procedure to drain its queue for scheduled downtimes. This explains why the number of jobs went down before the actual maintenance. After the maintenance the jobs where back in a few ours as expected. While this is not necessarily a problem per se, the mechanism has an impact on the capability to reach the pledges.
We badly need to improve the communnication flow between VOs and CSCS. Try with:
- Hot topics page (see Hot topics page) to collect changes to the system that can possibly affect operations
- Slack-chat in case special "real time" debugging is needed while the issue is in progress
- tickets to flag problems

UNIBE LHEP/Ubelix

UnibeT2ReportJan2020.pdf: UniBE Tier2 report

Minutes:

no particular discussion

T3 Sites reports

PSI

Conitinue migration of all T3 nodes to rhel7: Slurm clients were done
T3 downtime day for Storage Upgrade. Due to good preparation and testing the following upgrades were completed:
- dCache servers OS from sl6 to rhel7
- dCache from 3.2 to 5.2
- Postgresql from 9.5 to 11
- Firmware on Dalco storage pools and NetApp

There was storage discussion at T3 regarding POSIX perspective solutions to replace egeing hardware (40TB ZFS on Linux). So that here is my review summary concerning usage of dCache as NFS4.1 server: still looks like a bleeding edge development, needs constantly to update dCache and configurations.
- no user quota
- files are immutable
- experience report from NIKHEF (very few sites running nfs4.1/pnfs with dcache): https://indico.desy.de/indico/event/22170/session/1/contribution/6/material/slides/1.pdf
- recent comments from NIKHEF: - might stuck on client side with intensive IO - recommendation to update to vers. 6 (not yet Golden release)

UniGE

Can now request grid certificates via the Swiss CA
DPM pools upgraded to the latest version in line with the Bern ones
ARC CE deployment delayed, will have a meeting next Monday to outline the final steps

EGI

News

Review of open tickets

https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO

8 Tickets found
Ticket-ID	Type	VO	Site	Priority	Resp. Unit	Status	Last Update	Subject	Scope
145582		cms	CSCS-LCG2	urgent	NGI_CH	in progress	2020-02-19	T2_CH_CSCS is intermittently failing ...	WLCG
144898		cms	CSCS-LCG2	less urgent	NGI_CH	in progress	2020-01-22	T2_CH_CSCS warning - outdated version ...	WLCG
144499		none	T3_CH_PSI	less urgent	NGI_CH	assigned	2020-02-18	Upgrade to recent dCache release	EGI
144485		none	CSCS-LCG2	less urgent	NGI_CH	assigned	2020-02-04	Upgrade to recent dCache release	EGI
143464		none	UNIBE-LHEP	urgent	NGI_CH	in progress	2020-02-20	DPM at UNIBE-LHEP has to be configured ...	EGI
141276		none		less urgent	NGI_CH assigned	on hold	2019-11-26	yearly review of the information ...	EGI
131965		none	UNIBE-LHEP	less urgent	NGI_CH assigned	on hold	2020-01-20	IPv6 deployment at WLCG Tier-2 sites	EGI
131432		none	CSCS-LCG2	urgent	NGI_CH assigned involved	in progress	2020-01-27	Storage accounting deployment	EGI

a.o.b

Summary of the discussion at CSCS on 10.02.2020 20200220_OpsMeeting.pdf
Next meeting date: 05.03.2020 --> ATLAS GPU challenge + discussion of the slides above

Attendants

CSCS: Nick, Pablo, Dino, Dario, Gianni, Miguel, Matteo
CMS: Mauro, Christoph, Derek
ATLAS: Gianfranco
LHCb:Roland
EGI: Gianfranco

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	CHIPPreportJan2020.pdf	r2 r1	manage	5618.4 K	2020-02-19 - 07:03	NickCardo	CSCS January Site Report
pdf	ATLAST2reportJan2020pdf.pdf	r1	manage	403.0 K	2020-02-20 - 13:57	GianfrancoSciacca	ATLAS CH Tier2 report
pdf	UnibeT2ReportJan2020.pdf	r1	manage	186.7 K	2020-02-20 - 13:58	GianfrancoSciacca	UniBE Tier2 report
pdf	20200220_OpsMeeting.pdf	r1	manage	47.3 K	2020-02-20 - 12:36	MauroDonega

This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingSwissGridOperations20200220
Topic revision: r11 - 2020-02-21 - NickCardo