MeetingSwissGridOperations20151110 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2015-11-10
   * *Time*: 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS

<div id="_mcePaste"> *Systems:* </div> <div id="_mcePaste">

   * <span style="background-color: transparent;">HP Smart array issues (config loss and no boot), lost a lot of time with the HP support. Self solution found: Disable smart array and enable legacy mode for the boot disk.</span>
   * <span style="background-color: transparent;">Prolonged IB Bridges warranty until spring 2016</span>
   * <span style="background-color: transparent;">Requested new certificates for argus* with correct DNS AltName</span>
   * <span style="background-color: transparent;">LHCb job are still not running well, we suggested to Vladimir to use the right runtime env (env/proxy and glite), but still no changes.</span>
   * <span style="background-color: transparent;">CMS is testing multicore jobs</span>
   * <span style="background-color: transparent;">Working hard to finalize arc02 puppet cofiguration.<br /></span>
   * <span style="background-color: transparent;">We are planning to dismiss cream04</span>
   * <span style="background-color: transparent;">Planning the upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) almost all services on the cluster are affected.</span>
   * <span style="background-color: transparent;">Getting offers for the Phoenix expansion</span>
</div> <div id="_mcePaste"> *Storage:* </div> <div id="_mcePaste">
   * <span style="background-color: transparent;">Scratch - GPFS: Netapp storage firmware upgrade (no service interruption).</span>
   * <span style="background-color: transparent;">dCache:</span> 
      * <span style="background-color: transparent;">We still have the cleaner problem, mainly with CMS. At the moment the cleaner needs to be executed manually but the situation has been stabilized after some big deletions from CMS.</span>
      * <span style="background-color: transparent;">This week we should finalise the configuration of a pre-production system where we will test the 2.6 -&gt; 2.10 (2.13) upgrade in order to be able to upgrade the production by the end of this month.</span>
</div>

---+++ PSI
   * *NFSv4* 
      * Context : MeetingSwissGridOperations20151015#PSI
      * Eventually I made a RAID10 with 24disks, no spare
      * Instead of a single ZFS filesystem I made a hierarchy of filesystems, as advised by [[https://docs.oracle.com/cd/E23823_01/html/819-5461/gaypa.html][Oracle]]
      * By setting properties on the root of the hierarchy they'll get propagated to each descendant
      * Taking a recursive snapshot of the root of the hierarchy will take a snapshot of each descendant, *atomically at the same time*.
      * Taking snapshots ( but without giving the destroy permission ) can be delegated to each user on his/her own filesystem and also managed by simple NFSv4 =mkdir= commands ! [[http://docs.oracle.com/cd/E19253-01/819-5461/gebxb/index.html][Oracle Ref]] ; it needs a [[https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.5][tweaking]] on ZFS on Linux <pre>The use of mkdir/rmdir/mv in the .zfs/snapshot directory has been disabled by default both locally and via NFS clients. The zfs_admin_snapshot module option can be used to re-enable this functionality. </pre>
      * further tasks ongoing..
   * *dCache* : To CSCS, at PSI I've tuned this dCache Xrootd threshold xrootd.limits.threads=160 ; default is 1000 that was too high for us ; we were recurrently getting 1000 Xrootd sessions from Internet that eventually expired with a timeout.
   * *Security* : Processed the [[https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7183][EGI SVG Advisory - 'Critical' risk. Remote arbitrary code execution vulnerabilities in the core crypto library used by RedHat.]]
   * *General Interest* : 1TB [[https://owncloud.org/][OwnCloud]]/[[http://information-technology.web.cern.ch/services/eos-service][EOS]] @ CERN : http://cernbox.web.cern.ch/

---+++ UNIBE-LHEP
   * *Operations* 
      * Smooth(-ish) operation on ce02, quite stable at just over 500 cores. Nothing of relevance to report
      * Re-deployment of the ce01 cluster under way: 
         * SLC 6.7 and ARC 5.0.3 (needed a downgrade of opeldap* to have a functional resource bdii on the ARC CE)
         * about 900 worker-cores installed
         * new lustre (version 2.5.3, 200 disks), Thumpers decommissioned
         * moved to slurm, cutting my teeth on it.
         * hope to go online in the next few hours
      * Patching against CVE-2015-7183 (nss*, nspr* from slc6-testing)
   * *ATLAS specific operations* 
      * Implementing the requested monthly dumps of the namespace on the DPM SE.
---+++ UNIBE-ID
   * *Commissioning* 
      * Ordered the first 32 new compute nodes (Broadwell) with a total of 640 cores; delivered in 12/2015
      * Another 32 nodes will get ordered early in 2016
   * *Operations* 
      * Prolonged maintenance down due to painful migration to the new GPFS storage 
         * Lesson learned (us + IBM techie!): Using AFM and additonally doing rsyncs is a huge no go and leads to a corrupted filesystem when disabling AFM in the end
         * though no data loss
      * Since then smooth operation again
      * Upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) done tomorrow within the already setup maintenance down
   * <strong>ATLAS specific operations<br /></strong> 
      * no problems
      * ordered new SSL certificate for nordugrid.unibe.ch due to <span style="background-color: transparent;">STRICT_RFC2818 switch by Globus GSI clients</span>

---+++ UNIGE
   * *Operations* 
      * atlasfs18.unige.ch : ATLAS File Server, users reported problems with data transfers 
         * According to first checks from monitoring (Ganglia and Nagios) the machine was up and running
         * No remote access was allowed
         * Once re-started manually, not able to get it back: It is assumed a RAID controller problem
         * Fortunately, this machine is still under warranty by IBM (will be contacted for reparation)
         * A spare File Server was used instead (this is temporarily), disks moved to the temporary machine
         * No further problems observed since then for atlasfs18.unige.ch
      * I will ask for a host certificate, for a new ATLAS File Server to be added into the cluster
      * Another File Server has been already installed, but this is for DAMPE experiment (no host certificate needed)
      * We have new hardware to be installed at the cluster: File Servers and a couple of PCs for services
   * *Network - Outlook* 
      * We intend for a new network switch of 10 Gb/s, but this is still under negotiation
      * Most likely, it will be in the beggining of next year
   * *Storage* 
      * There is a DPM SE workshop at CERN on December 7th-8th (probably intesresting for other sites with DPM SE). I will attend it
      * Checking the data stored at the DPM SE for cleaning purposes, since ATLAS before had a data management tool called "dq2"and now it is "rucio"
      * Checking data in order to identify files which are registered in the catalogue (rucio), but not physically at the DPM SE and vice versa

---+++ NGI_CH
   * Profile ch.cern.sam-ROC_CRITICAL for ops: http://mon.egi.eu/myegi/sa/?view=2&graph=1&vo=104&profile=26&filters-value-Regions_or_Tiers=115&filters-value-Sites=&production=1&preproduction=1&dateorperiod=pd&period=pM&startdate=01-08-2015&enddate=30-09-2015
   * https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7183
   * Survey on "Quality process and ISO certification" (Quality Management, IT Service Management, Information Security Management): https://www.surveymonkey.com/r/isocertification
---++ Other topics
   * Daniel being replaced as CMS contact person
   * Topic2
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS: Pablo, Dario, Dino, Gianni
   * CMS: Fabio Martinelli, Daniel Meister
   * ATLAS: Gianfranco, Luis March
   * LHCb: Roland Bernet
   * EGI: Gianfranco

---++ Action items
   * Item1