We don't over-subscribe memory anymore: nodes don't starve and crash
Memory usage is properly accounted for in 15.08 (PSS): no jobs killed on (artificial) over-limit of "vmem" (now the full address space reserved by a process, no what's allocated or used)
Comparing job fail rates between ce01 and ce02 (still on old SGE) has convinced me to rush the re-installation of ce02 (started earlier today)
ATLAS specific operations
Stable worflows by ATLAS (very large improvement since beginning of run II)
Stuck with the implementation of monthly dumps of the namespace on the DPM SE:
headnode on SLC5: the dump script does not work and also generating a valid proxy is problematic
decided to push the re-deployment of the head node on SLC6
legacy config tool (YAIM) no longer supported
puppet based configuration, got the right docs at the DPM workshop earlier this week in CERN
tests ongoing on a pps VM
also complicated by the fact my site-bdii is still co-located with the DPM head node
this will likely be the first task for 2016
UNIBE-ID
Xxx
UNIGE
Operations
atlasfs29.unige.ch : New certificate
Another File Server has been already installed, but this is for DAMPE experiment (no host certificate needed)
We have new hardware to be installed at the cluster: File Servers and a couple of PCs for services
We will install puppet for DPM and probably cluster configuration and setup: Let's say we will use a testbed with atlasfs29 + 1 PC of service (1 out of 2, of the previous ones mentioned just above)
Network - Outlook
We intend for a new network switch of 10 Gb/s, but this is still under negotiation
Most likely, it will be in the beggining of next year