Swiss Grid Operations Meeting on 2015-11-10

Time: 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Site status

Systems:

HP Smart array issues (config loss and no boot), lost a lot of time with the HP support. Self solution found: Disable smart array and enable legacy mode for the boot disk.
Prolonged IB Bridges warranty until spring 2016
Requested new certificates for argus* with correct DNS AltName
LHCb job are still not running well, we suggested to Vladimir to use the right runtime env (env/proxy and glite), but still no changes.
CMS is testing multicore jobs
Working hard to finalize arc02 puppet cofiguration.
We are planning to dismiss cream04
Planning the upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) almost all services on the cluster are affected.
Getting offers for the Phoenix expansion

Storage:

Scratch - GPFS: Netapp storage firmware upgrade (no service interruption).
dCache:
- We still have the cleaner problem, mainly with CMS. At the moment the cleaner needs to be executed manually but the situation has been stabilized after some big deletions from CMS.
- This week we should finalise the configuration of a pre-production system where we will test the 2.6 -> 2.10 (2.13) upgrade in order to be able to upgrade the production by the end of this month.

NFSv4
- Context : MeetingSwissGridOperations20151015#PSI
- Eventually I made a RAID10 with 24disks, no spare
- Instead of a single ZFS filesystem I made a hierarchy of filesystems, as advised by Oracle
- By setting properties on the root of the hierarchy they'll get propagated to each descendant
- Taking a recursive snapshot of the root of the hierarchy will take a snapshot of each descendant, atomically at the same time.
- Taking snapshots ( but without giving the destroy permission ) can be delegated to each user on his/her own filesystem and also managed by simple NFSv4 mkdir commands ! Oracle Ref ; it needs a tweaking on ZFS on Linux
```
The use of mkdir/rmdir/mv in the .zfs/snapshot directory has been disabled by default both locally and via NFS clients. The zfs_admin_snapshot module option can be used to re-enable this functionality. 
```
- further tasks ongoing..
dCache : To CSCS, at PSI I've tuned this dCache Xrootd threshold xrootd.limits.threads=160 ; default is 1000 that was too high for us ; we were recurrently getting 1000 Xrootd sessions from Internet that eventually expired with a timeout.
Security : Processed the EGI SVG Advisory - 'Critical' risk. Remote arbitrary code execution vulnerabilities in the core crypto library used by RedHat.
General Interest : 1TB OwnCloud/EOS @ CERN : http://cernbox.web.cern.ch/

Operations
- Smooth(-ish) operation on ce02, quite stable at just over 500 cores. Nothing of relevance to report
- Re-deployment of the ce01 cluster under way:
  - SLC 6.7 and ARC 5.0.3 (needed a downgrade of opeldap* to have a functional resource bdii on the ARC CE)
  - about 900 worker-cores installed
  - new lustre (version 2.5.3, 200 disks), Thumpers decommissioned
  - moved to slurm, cutting my teeth on it.
  - hope to go online in the next few hours
- Patching against CVE-2015-7183 (nss*, nspr* from slc6-testing)
ATLAS specific operations
- Implementing the requested monthly dumps of the namespace on the DPM SE.

Commissioning
- Ordered the first 32 new compute nodes (Broadwell) with a total of 640 cores; delivered in 12/2015
- Another 32 nodes will get ordered early in 2016
Operations
- Prolonged maintenance down due to painful migration to the new GPFS storage
  - Lesson learned (us + IBM techie!): Using AFM and additonally doing rsyncs is a huge no go and leads to a corrupted filesystem when disabling AFM in the end
  - though no data loss
- Since then smooth operation again
- Upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) done tomorrow within the already setup maintenance down
ATLAS specific operations
- no problems
- ordered new SSL certificate for nordugrid.unibe.ch due to STRICT_RFC2818 switch by Globus GSI clients

Operations
- atlasfs18.unige.ch : ATLAS File Server, users reported problems with data transfers
  - According to first checks from monitoring (Ganglia and Nagios) the machine was up and running
  - No remote access was allowed
  - Once re-started manually, not able to get it back: It is assumed a RAID controller problem
  - Fortunately, this machine is still under warranty by IBM (will be contacted for reparation)
  - A spare File Server was used instead (this is temporarily), disks moved to the temporary machine
  - No further problems observed since then for atlasfs18.unige.ch
- I will ask for a host certificate, for a new ATLAS File Server to be added into the cluster
- Another File Server has been already installed, but this is for DAMPE experiment (no host certificate needed)
- We have new hardware to be installed at the cluster: File Servers and a couple of PCs for services
Network - Outlook
- We intend for a new network switch of 10 Gb/s, but this is still under negotiation
- Most likely, it will be in the beggining of next year
Storage
- There is a DPM SE workshop at CERN on December 7th-8th (probably intesresting for other sites with DPM SE). I will attend it
- Checking the data stored at the DPM SE for cleaning purposes, since ATLAS before had a data management tool called "dq2"and now it is "rucio"
- Checking data in order to identify files which are registered in the catalogue (rucio), but not physically at the DPM SE and vice versa

This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingSwissGridOperations20151110
Topic revision: r19 - 2015-11-11 - FabioMartinelli