Swiss Grid Operations Meeting on 2015-12-10

Date and time: 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Site status

Storage
- dCache: stable but still have to run the cleaner manually. Upgrade to 2.10 will be performed on Wed 13th Jan 2016
- Atlas: working on the monthly dumps
- GPFS (scratch): nothing to report
- New hardware: 4 server for dcache and ~1PB of storage. Working to move GPFS metadata disk on Flash based storage.
Compute
- Added some check function to nodehealtcheck:
  - SWAP cleaner
  - auto solve some blakhole scenarios like auto remount fs
  - after 60 + random number of days the node is putted in dreain for clean and reboot
- Started some test with new slurm version, to migrate sltop.
- Today we will order 40 new compute node with E5-2680v4

Operations
- ce01 cluster re-installation virtually completed (about 900 worker cores running, 120 still to be installed, 256 awaiting delivery)
- Started with a simple slurm setup (slurm-15.08.1) in order to cut down on commissioning time: one partition with
```
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
MemLimitEnforce=no
```
- We don't over-subscribe memory anymore: nodes don't starve and crash
- Memory usage is properly accounted for in 15.08 (PSS): no jobs killed on (artificial) over-limit of "vmem" (now the full address space reserved by a process, no what's allocated or used)
- Comparing job fail rates between ce01 and ce02 (still on old SGE) has convinced me to rush the re-installation of ce02 (started earlier today)
ATLAS specific operations
- Stable worflows by ATLAS (very large improvement since beginning of run II)
- Stuck with the implementation of monthly dumps of the namespace on the DPM SE:
  - headnode on SLC5: the dump script does not work and also generating a valid proxy is problematic
  - decided to push the re-deployment of the head node on SLC6
  - legacy config tool (YAIM) no longer supported
  - puppet based configuration, got the right docs at the DPM workshop earlier this week in CERN
  - tests ongoing on a pps VM
  - also complicated by the fact my site-bdii is still co-located with the DPM head node
  - this will likely be the first task for 2016

Operations
- atlasfs29.unige.ch : New certificate
- Another File Server has been already installed, but this is for DAMPE experiment (no host certificate needed)
- We have new hardware to be installed at the cluster: File Servers and a couple of PCs for services
- We will install puppet for DPM and probably cluster configuration and setup: Let's say we will use a testbed with atlasfs29 + 1 PC of service (1 out of 2, of the previous ones mentioned just above)
Network - Outlook
- We intend for a new network switch of 10 Gb/s, but this is still under negotiation
- Most likely, it will be in the beggining of next year
Storage
- There wass a DPM SE workshop at CERN on December 7th-8th: https://indico.cern.ch/event/432642/
- Checking the data stored at the DPM SE for cleaning purposes, since ATLAS requested it
- Checking data in order to identify files which are registered in the catalogue (rucio), but not physically at the DPM SE and vice versa