Swiss Grid Operations Meeting on 2015-11-10
Site status
CSCS
Systems:
- HP Smart array issues (config loss and no boot), lost a lot of time with the HP support. Self solution found: Disable smart array and enable legacy mode for the boot disk.
- Prolonged IB Bridges warranty until spring 2016
- Requested new certificates for argus* with correct DNS AltName
- LHCb job are still not running well, we suggested to Vladimir to use the right runtime env (env/proxy and glite), but still no changes.
- CMS is testing multicore jobs
- Working hard to finalize arc02 puppet cofiguration.
- We are planning to dismiss cream04
- Planning the upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) almost all services on the cluster are affected.
- Getting offers for the Phoenix expansion
Storage:
- Scratch - GPFS: Netapp storage firmware upgrade (no service interruption).
- dCache:
- We still have the cleaner problem, mainly with CMS. At the moment the cleaner needs to be executed manually but the situation has been stabilized after some big deletions from CMS.
- This week we should finalise the configuration of a pre-production system where we will test the 2.6 -> 2.10 (2.13) upgrade in order to be able to upgrade the production by the end of this month.
PSI
UNIBE-LHEP
- Operations
- Smooth(-ish) operation on ce02, quite stable at just over 500 cores. Nothing of relevance to report
- Re-deployment of the ce01 cluster under way:
- SLC 6.7 and ARC 5.0.3 (needed a downgrade of opeldap* to have a functional resource bdii on the ARC CE)
- about 900 worker-cores installed
- new lustre (version 2.5.3, 200 disks), Thumpers decommissioned
- moved to slurm, cutting my teeth on it.
- hope to go online in the next few hours
- Patching against CVE-2015-7183 (nss*, nspr* from slc6-testing)
- ATLAS specific operations
- Implementing the requested monthly dumps of the namespace on the DPM SE.
UNIBE-ID
- Commissioning
- Ordered the first 32 new compute nodes (Broadwell) with a total of 640 cores; delivered in 12/2015
- Another 32 nodes will get ordered early in 2016
- Operations
- Prolonged maintenance down due to painful migration to the new GPFS storage
- Lesson learned (us + IBM techie!): Using AFM and additonally doing rsyncs is a huge no go and leads to a corrupted filesystem when disabling AFM in the end
- though no data loss
- Since then smooth operation again
- Upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) done tomorrow within the already setup maintenance down
- ATLAS specific operations
- no problems
- ordered new SSL certificate for nordugrid.unibe.ch due to STRICT_RFC2818 switch by Globus GSI clients
UNIGE
- Operations
- atlasfs18.unige.ch : ATLAS File Server, users reported problems with data transfers
- According to first checks from monitoring (Ganglia and Nagios) the machine was up and running
- No remote access was allowed
- Once re-started manually, not able to get it back: It is assumed a RAID controller problem
- Fortunately, this machine is still under warranty by IBM (will be contacted for reparation)
- A spare File Server was used instead (this is temporarily), disks moved to the temporary machine
- No further problems observed since then for atlasfs18.unige.ch
- I will ask for a host certificate, for a new ATLAS File Server to be added into the cluster
- Another File Server has been already installed, but this is for DAMPE experiment (no host certificate needed)
- We have new hardware to be installed at the cluster: File Servers and a couple of PCs for services
- Network - Outlook
- We intend for a new network switch of 10 Gb/s, but this is still under negotiation
- Most likely, it will be in the beggining of next year
- Storage
- There is a DPM SE workshop at CERN on December 7th-8th (probably intesresting for other sites with DPM SE). I will attend it
- Checking the data stored at the DPM SE for cleaning purposes, since ATLAS before had a data management tool called "dq2"and now it is "rucio"
- Checking data in order to identify files which are registered in the catalogue (rucio), but not physically at the DPM SE and vice versa
NGI_CH
Other topics
- Daniel being replaced as CMS contact person
- Topic2
Next meeting date:
A.O.B.
Attendants
- CSCS: Pablo, Dario, Dino, Gianni
- CMS: Fabio Martinelli, Daniel Meister
- ATLAS: Gianfranco, Luis March
- LHCb: Roland Bernet
- EGI: Gianfranco
Action items
This topic: LCGTier2
> WebHome >
MeetingsBoard > MeetingSwissGridOperations20151110
Topic revision: r19 - 2015-11-11 - FabioMartinelli