Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2015-11-10

Site status

CSCS

Systems:

  • HP Smart array issues (config loss and no boot), lost a lot of time with the HP support. Self solution found: Disable smart array and enable legacy mode for the boot disk.
  • Prolonged IB Bridges warranty until spring 2016
  • Requested new certificates for argus* with correct DNS AltName
  • LHCb job are still not running well, we suggested to Vladimir to use the right runtime env (env/proxy and glite), but still no changes.
  • CMS is testing multicore jobs
  • Working hard to finalize arc02 puppet cofiguration.
  • We are planning to dismiss cream04
  • Planning the upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) almost all services on the cluster are affected.
  • Getting offers for the Phoenix expansion
Storage:
  • Scratch - GPFS: Netapp storage firmware upgrade (no service interruption).
  • dCache:
    • We still have the cleaner problem, mainly with CMS. At the moment the cleaner needs to be executed manually but the situation has been stabilized after some big deletions from CMS.
    • This week we should finalise the configuration of a pre-production system where we will test the 2.6 -> 2.10 (2.13) upgrade in order to be able to upgrade the production by the end of this month.

PSI

  • NFSv4
    • Context : MeetingSwissGridOperations20151015#PSI
    • Eventually I made a RAID10 with 24disks, no spare
    • Instead of a single ZFS filesystem I made a hierarchy of filesystems, as advised by Oracle
    • By setting properties on the root of the hierarchy they'll get propagated to each descendant
    • Taking a recursive snapshot of the root of the hierarchy will take a snapshot of each descendant, atomically at the same time.
    • Taking snapshots ( but without giving the destroy permission ) can be delegated to each user on his/her own filesystem and also managed by simple NFSv4 mkdir commands ! Oracle Ref ; it needs a tweaking on ZFS on Linux
      The use of mkdir/rmdir/mv in the .zfs/snapshot directory has been disabled by default both locally and via NFS clients. The zfs_admin_snapshot module option can be used to re-enable this functionality. 
    • further tasks ongoing..
  • dCache : To CSCS, at PSI I've tuned this dCache Xrootd threshold xrootd.limits.threads=160 ; default is 1000 that was too high for us ; we were recurrently getting 1000 Xrootd sessions from Internet that eventually expired with a timeout.
  • Security : Processed the EGI SVG Advisory - 'Critical' risk. Remote arbitrary code execution vulnerabilities in the core crypto library used by RedHat.
  • General Interest : 1TB OwnCloud/EOS @ CERN : http://cernbox.web.cern.ch/

UNIBE-LHEP

  • Operations
    • Smooth(-ish) operation on ce02, quite stable at just over 500 cores. Nothing of relevance to report
    • Re-deployment of the ce01 cluster under way:
      • SLC 6.7 and ARC 5.0.3 (needed a downgrade of opeldap* to have a functional resource bdii on the ARC CE)
      • about 900 worker-cores installed
      • new lustre (version 2.5.3, 200 disks), Thumpers decommissioned
      • moved to slurm, cutting my teeth on it.
      • hope to go online in the next few hours
    • Patching against CVE-2015-7183 (nss*, nspr* from slc6-testing)
  • ATLAS specific operations
    • Implementing the requested monthly dumps of the namespace on the DPM SE.

UNIBE-ID

  • Commissioning
    • Ordered the first 32 new compute nodes (Broadwell) with a total of 640 cores; delivered in 12/2015
    • Another 32 nodes will get ordered early in 2016
  • Operations
    • Prolonged maintenance down due to painful migration to the new GPFS storage
      • Lesson learned (us + IBM techie!): Using AFM and additonally doing rsyncs is a huge no go and leads to a corrupted filesystem when disabling AFM in the end
      • though no data loss
    • Since then smooth operation again
    • Upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) done tomorrow within the already setup maintenance down
  • ATLAS specific operations
    • no problems
    • ordered new SSL certificate for nordugrid.unibe.ch due to STRICT_RFC2818 switch by Globus GSI clients

UNIGE

  • Operations
    • atlasfs18.unige.ch : ATLAS File Server, users reported problems with data transfers
      • According to first checks from monitoring (Ganglia and Nagios) the machine was up and running
      • No remote access was allowed
      • Once re-started manually, not able to get it back: It is assumed a RAID controller problem
      • Fortunately, this machine is still under warranty by IBM (will be contacted for reparation)
      • A spare File Server was used instead (this is temporarily), disks moved to the temporary machine
      • No further problems observed since then for atlasfs18.unige.ch
    • I will ask for a host certificate, for a new ATLAS File Server to be added into the cluster
    • Another File Server has been already installed, but this is for DAMPE experiment (no host certificate needed)
    • We have new hardware to be installed at the cluster: File Servers and a couple of PCs for services
  • Network - Outlook
    • We intend for a new network switch of 10 Gb/s, but this is still under negotiation
    • Most likely, it will be in the beggining of next year
  • Storage
    • There is a DPM SE workshop at CERN on December 7th-8th (probably intesresting for other sites with DPM SE). I will attend it
    • Checking the data stored at the DPM SE for cleaning purposes, since ATLAS before had a data management tool called "dq2"and now it is "rucio"
    • Checking data in order to identify files which are registered in the catalogue (rucio), but not physically at the DPM SE and vice versa

NGI_CH

Other topics

  • Daniel being replaced as CMS contact person
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS: Pablo, Dario, Dino, Gianni
  • CMS: Fabio Martinelli, Daniel Meister
  • ATLAS: Gianfranco, Luis March
  • LHCb: Roland Bernet
  • EGI: Gianfranco

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r19 - 2015-11-11 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback