<!-- keep this as a security measure:
* Set ALLOWTOPICCHANGE =
TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
* Set ALLOWTOPICRENAME =
TWikiAdminGroup,Main.LCGAdminGroup
#uncomment this if you want the page only be viewable by the internal people
#* Set ALLOWTOPICVIEW =
TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->
Swiss Grid Operations Meeting on 2020-02-20 at 15:30
Followup from previous Action Items
Action items
VO reports
ATLAS
Minutes:
- the pledges were achieved (surpassed) in the past month
- while the pledged performance were achieved, shortfall on the 40:40:20 share / see Nick slides to understand why the 40:40:20 was difficult to achieve last month
- 15-20% of the 1-core EVGEN jobs timeout w/o logs. Debugging with Miguel not cocnlusive yet. Impossible to save the session for all jobs from ATLAS--> try to do it for single core production jobs --> then debug
- split the monitoring plot of "Slots of running jobs" in two plots - just not to have the stacked plots. The total pledged is automatically added in the plots, add by hand a dotted line with the CSCS-only pledge
- All dips in the plots for both CSCS and Bern discussed and understood
- The color coding in the table "ATLAS T2-statistics" is red < 75% < yellow < 90%< green
- The color coding in the table "ATLAS Hammercloud statistics" is red < 95% < green
- Some details about the Swiss ATLAS federation will be send in the next days
CMS
Minutes:
- a ticket has been filed to understand why CMS had so few pending jobs (the initial explanation is that there were no jobs to run in the whole CMS which sounds surprising)
LHCb
Minutes:
- LHCb all OK
- 2 misconfigurations occurred on the LHCb side
- affected by the CEPH downtime at CERN
- IP6 issues at CERN
T2 Sites reports
CSCS
January utilization:
- Pledges
- 112.2% CHIPP overall
- 104.1% ATLAS
- 101.5% CMS
- 149.9% LHCb
- Sharing
- ATLAS:CMS:LHCb [%] 37:36:27
Minutes:
- discussion of the dried out queues
- Whenever a VO does not submit jobs (e.g. by watching the grafana page) warn CSCS and try to debug the case as much in real time as possible
- ATLAS has a procedure to drain its queue for scheduled downtimes. This explains why the number of jobs went down before the actual maintenance. After the maintenance the jobs where back in a few ours as expected. While this is not necessarily a problem per se, the mechanism has an impact on the capability to reach the pledges.
- We badly need to improve the communnication flow between VOs and CSCS. Try with:
- Hot topics page (see Hot topics page) to collect changes to the system that can possibly affect operations
- Slack-chat in case special "real time" debugging is needed while the issue is in progress
- tickets to flag problems
UNIBE LHEP/Ubelix
Minutes:
T3 Sites reports
PSI
- Conitinue migration of all T3 nodes to rhel7: Slurm clients were done
- T3 downtime day for Storage Upgrade. Due to good preparation and testing the following upgrades were completed:
- dCache servers OS from sl6 to rhel7
- dCache from 3.2 to 5.2
- Postgresql from 9.5 to 11
- Firmware on Dalco storage pools and NetApp
- There was storage discussion at T3 regarding POSIX perspective solutions to replace egeing hardware (40TB ZFS on Linux). So that here is my review summary concerning usage of dCache as NFS4.1 server: still looks like a bleeding edge development, needs constantly to update dCache and configurations.
- Can now request grid certificates via the Swiss CA
- DPM pools upgraded to the latest version in line with the Bern ones
- ARC CE deployment delayed, will have a meeting next Monday to outline the final steps
EGI
News
Review of open tickets
a.o.b
- Summary of the discussion at CSCS on 10.02.2020 20200220_OpsMeeting.pdf
- Next meeting date: 05.03.2020 --> ATLAS GPU challenge + discussion of the slides above
Attendants
- CSCS: Nick, Pablo, Dino, Dario, Gianni, Miguel, Matteo
- CMS: Mauro, Christoph, Derek
- ATLAS: Gianfranco
- LHCb:Roland
- EGI: Gianfranco