Swiss Grid Operations Meeting on 2020-02-20 at 15:30

Followup from previous Action Items

Action items

VO reports


  • the pledges were achieved (surpassed) in the past month
  • while the pledged performance were achieved, shortfall on the 40:40:20 share / see Nick slides to understand why the 40:40:20 was difficult to achieve last month
  • 15-20% of the 1-core EVGEN jobs timeout w/o logs. Debugging with Miguel not cocnlusive yet. Impossible to save the session for all jobs from ATLAS--> try to do it for single core production jobs --> then debug
  • split the monitoring plot of "Slots of running jobs" in two plots - just not to have the stacked plots. The total pledged is automatically added in the plots, add by hand a dotted line with the CSCS-only pledge
  • All dips in the plots for both CSCS and Bern discussed and understood
  • The color coding in the table "ATLAS T2-statistics" is red < 75% < yellow < 90%< green
  • The color coding in the table "ATLAS Hammercloud statistics" is red < 95% < green
  • Some details about the Swiss ATLAS federation will be send in the next days

    Clarification from ATLAS (added 11th March 2020):
  • It has been claimed ATLAS have introduced singularity (and other changes) without notice. The real facts are of course radically different, and this involved a lot of work both from the ATLAS and CSCS side. In the interest of avoiding further propagation of inaccuracies, the details of the switch to singularity follow below (documented in Slack and RT):
    • Aug: discussed with ATLAS how to best treat the CSCS case (non compliant OS and running shifter)
    • 20th Aug: heads up on atlas_mon
    • 21st Aug: details of settings needed and test instructions presented
    • 29th Aug: feedback from Miguel (previously on holiday), ticket #36869 opened
    • 30th Aug: tarball with ATLAS pilot code for testing on Daint made available
    • 2nd Sep: central ops for ATLAS contacted to arrange the switch togtheter with CSCS (switch to singularity needs a change at CSCS and a simultaneous change in ATLAS).
    • Sept: switch date postponed to wait for recovery from a severe dcache incident involving data loss
    • 13th Sep: arranged with Miguel and ATLAS expert the exact time point to carry out the switch to singularity (during f2f meeting), syncing the two sides via Skype and Slack.
    • 13th Sep: Switch succesful and validated on both sides
    • 18th Sep (5 days later): HC blacklisting and investigation, heads up by Gianfranco about possible extra stress on cvmfs due to containers sourced off it. Later on, heads up from ATLAS ops about cvmfs issues at CSCS. Miguel tried increasing in-RAM cache on some nodes
    • 19th Sep: downtime in GOCDB to change CVMFS configuration on all nodes.



  • a ticket has been filed to understand why CMS had so few pending jobs (the initial explanation is that there were no jobs to run in the whole CMS which sounds surprising)



  • LHCb all OK
  • 2 misconfigurations occurred on the LHCb side
  • affected by the CEPH downtime at CERN
  • IP6 issues at CERN

T2 Sites reports


January utilization:
  • Pledges
    • 112.2% CHIPP overall
    • 104.1% ATLAS
    • 101.5% CMS
    • 149.9% LHCb
  • Sharing
    • ATLAS:CMS:LHCb [%] 37:36:27
  • discussion of the dried out queues
  • Whenever a VO does not submit jobs (e.g. by watching the grafana page) warn CSCS and try to debug the case as much in real time as possible
  • ATLAS has a procedure to drain its queue for scheduled downtimes. This explains why the number of jobs went down before the actual maintenance. After the maintenance the jobs where back in a few ours as expected. While this is not necessarily a problem per se, the mechanism has an impact on the capability to reach the pledges.
  • We badly need to improve the communnication flow between VOs and CSCS. Try with:
    • Hot topics page (see Hot topics page) to collect changes to the system that can possibly affect operations
    • Slack-chat in case special "real time" debugging is needed while the issue is in progress
    • tickets to flag problems
Clarification from ATLAS (added 11th March 2020):
  • Adding plot of ATLAS pending jobs for January. There is always a healthy backlog of submitted jobs (contrary to CSCS claim). The only gap visible in the plot corresponds to a HC 24h blacklisting. This was determined to have been caused by an un-announced change in the singularity version run at CSCS, causing every job to fail (investigation conducted on Fryday 10th Jan by Miguel and Gianfranco)

  • ATLAS-pending-jobs-Jan2020.png


  • no particular discussion

T3 Sites reports


  • Conitinue migration of all T3 nodes to rhel7: Slurm clients were done
  • T3 downtime day for Storage Upgrade. Due to good preparation and testing the following upgrades were completed:
    • dCache servers OS from sl6 to rhel7
    • dCache from 3.2 to 5.2
    • Postgresql from 9.5 to 11
    • Firmware on Dalco storage pools and NetApp

  • There was storage discussion at T3 regarding POSIX perspective solutions to replace egeing hardware (40TB ZFS on Linux). So that here is my review summary concerning usage of dCache as NFS4.1 server: still looks like a bleeding edge development, needs constantly to update dCache and configurations.


  • Can now request grid certificates via the Swiss CA
  • DPM pools upgraded to the latest version in line with the Bern ones
  • ARC CE deployment delayed, will have a meeting next Monday to outline the final steps



  • NTR

Review of open tickets

  • https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO

    8 Tickets found
    Ticket-ID Type VO Site Priority Resp. Unit Status Last Update Subject Scope
    145582 cms CSCS-LCG2 urgent NGI_CH in progress 2020-02-19 T2_CH_CSCS is intermittently failing ... WLCG
    144898 cms CSCS-LCG2 less urgent NGI_CH in progress 2020-01-22 T2_CH_CSCS warning - outdated version ... WLCG
    144499 none T3_CH_PSI less urgent NGI_CH assigned 2020-02-18 Upgrade to recent dCache release EGI
    144485 none CSCS-LCG2 less urgent NGI_CH assigned 2020-02-04 Upgrade to recent dCache release EGI
    143464 none UNIBE-LHEP urgent NGI_CH in progress 2020-02-20 DPM at UNIBE-LHEP has to be configured ... EGI
    141276 none less urgent NGI_CH assigned on hold 2019-11-26 yearly review of the information ... EGI
    131965 none UNIBE-LHEP less urgent NGI_CH assigned on hold 2020-01-20 IPv6 deployment at WLCG Tier-2 sites EGI
    131432 none CSCS-LCG2 urgent NGI_CH assigned involved in progress 2020-01-27 Storage accounting deployment EGI


  • Summary of the discussion at CSCS on 10.02.2020 20200220_OpsMeeting.pdf
  • Next meeting date: 05.03.2020 --> ATLAS GPU challenge + discussion of the slides above


  • CSCS: Nick, Pablo, Dino, Dario, Gianni, Miguel, Matteo
  • CMS: Mauro, Christoph, Derek
  • ATLAS: Gianfranco
  • LHCb:Roland
  • EGI: Gianfranco

  • vo_report_20Feb20.pdf: cms_vo_report_feb20
