<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2015-11-10 * *Time*: 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS <div id="_mcePaste"> *Systems:* </div> <div id="_mcePaste"> * <span style="background-color: transparent;">HP Smart array issues (config loss and no boot), lost a lot of time with the HP support. Self solution found: Disable smart array and enable legacy mode for the boot disk.</span> * <span style="background-color: transparent;">Prolonged IB Bridges warranty until spring 2016</span> * <span style="background-color: transparent;">Requested new certificates for argus* with correct DNS AltName</span> * <span style="background-color: transparent;">LHCb job are still not running well, we suggested to Vladimir to use the right runtime env (env/proxy and glite), but still no changes.</span> * <span style="background-color: transparent;">CMS is testing multicore jobs</span> * <span style="background-color: transparent;">Working hard to finalize arc02 puppet cofiguration.<br /></span> * <span style="background-color: transparent;">We are planning to dismiss cream04</span> * <span style="background-color: transparent;">Planning the upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) almost all services on the cluster are affected.</span> * <span style="background-color: transparent;">Getting offers for the Phoenix expansion</span> </div> <div id="_mcePaste"> *Storage:* </div> <div id="_mcePaste"> * <span style="background-color: transparent;">Scratch - GPFS: Netapp storage firmware upgrade (no service interruption).</span> * <span style="background-color: transparent;">dCache:</span> * <span style="background-color: transparent;">We still have the cleaner problem, mainly with CMS. At the moment the cleaner needs to be executed manually but the situation has been stabilized after some big deletions from CMS.</span> * <span style="background-color: transparent;">This week we should finalise the configuration of a pre-production system where we will test the 2.6 -> 2.10 (2.13) upgrade in order to be able to upgrade the production by the end of this month.</span> </div> ---+++ PSI * *NFSv4* * Context : MeetingSwissGridOperations20151015#PSI * Eventually I made a RAID10 with 24disks, no spare * Instead of a single ZFS filesystem I made a hierarchy of filesystems, as advised by [[https://docs.oracle.com/cd/E23823_01/html/819-5461/gaypa.html][Oracle]] * By setting properties on the root of the hierarchy they'll get propagated to each descendant * Taking a recursive snapshot of the root of the hierarchy will take a snapshot of each descendant, *atomically at the same time*. * Taking snapshots ( but without giving the destroy permission ) can be delegated to each user on his/her own filesystem and also managed by simple NFSv4 =mkdir= commands ! [[http://docs.oracle.com/cd/E19253-01/819-5461/gebxb/index.html][Oracle Ref]] ; it needs a [[https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.6.5][tweaking]] on ZFS on Linux <pre>The use of mkdir/rmdir/mv in the .zfs/snapshot directory has been disabled by default both locally and via NFS clients. The zfs_admin_snapshot module option can be used to re-enable this functionality. </pre> * further tasks ongoing.. * *dCache* : To CSCS, at PSI I've tuned this dCache Xrootd threshold xrootd.limits.threads=160 ; default is 1000 that was too high for us ; we were recurrently getting 1000 Xrootd sessions from Internet that eventually expired with a timeout. * *Security* : Processed the [[https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7183][EGI SVG Advisory - 'Critical' risk. Remote arbitrary code execution vulnerabilities in the core crypto library used by RedHat.]] * *General Interest* : 1TB [[https://owncloud.org/][OwnCloud]]/[[http://information-technology.web.cern.ch/services/eos-service][EOS]] @ CERN : http://cernbox.web.cern.ch/ ---+++ UNIBE-LHEP * *Operations* * Smooth(-ish) operation on ce02, quite stable at just over 500 cores. Nothing of relevance to report * Re-deployment of the ce01 cluster under way: * SLC 6.7 and ARC 5.0.3 (needed a downgrade of opeldap* to have a functional resource bdii on the ARC CE) * about 900 worker-cores installed * new lustre (version 2.5.3, 200 disks), Thumpers decommissioned * moved to slurm, cutting my teeth on it. * hope to go online in the next few hours * Patching against CVE-2015-7183 (nss*, nspr* from slc6-testing) * *ATLAS specific operations* * Implementing the requested monthly dumps of the namespace on the DPM SE. ---+++ UNIBE-ID * *Commissioning* * Ordered the first 32 new compute nodes (Broadwell) with a total of 640 cores; delivered in 12/2015 * Another 32 nodes will get ordered early in 2016 * *Operations* * Prolonged maintenance down due to painful migration to the new GPFS storage * Lesson learned (us + IBM techie!): Using AFM and additonally doing rsyncs is a huge no go and leads to a corrupted filesystem when disabling AFM in the end * though no data loss * Since then smooth operation again * Upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) done tomorrow within the already setup maintenance down * <strong>ATLAS specific operations<br /></strong> * no problems * ordered new SSL certificate for nordugrid.unibe.ch due to <span style="background-color: transparent;">STRICT_RFC2818 switch by Globus GSI clients</span> ---+++ UNIGE * *Operations* * atlasfs18.unige.ch : ATLAS File Server, users reported problems with data transfers * According to first checks from monitoring (Ganglia and Nagios) the machine was up and running * No remote access was allowed * Once re-started manually, not able to get it back: It is assumed a RAID controller problem * Fortunately, this machine is still under warranty by IBM (will be contacted for reparation) * A spare File Server was used instead (this is temporarily), disks moved to the temporary machine * No further problems observed since then for atlasfs18.unige.ch * I will ask for a host certificate, for a new ATLAS File Server to be added into the cluster * Another File Server has been already installed, but this is for DAMPE experiment (no host certificate needed) * We have new hardware to be installed at the cluster: File Servers and a couple of PCs for services * *Network - Outlook* * We intend for a new network switch of 10 Gb/s, but this is still under negotiation * Most likely, it will be in the beggining of next year * *Storage* * There is a DPM SE workshop at CERN on December 7th-8th (probably intesresting for other sites with DPM SE). I will attend it * Checking the data stored at the DPM SE for cleaning purposes, since ATLAS before had a data management tool called "dq2"and now it is "rucio" * Checking data in order to identify files which are registered in the catalogue (rucio), but not physically at the DPM SE and vice versa ---+++ NGI_CH * Profile ch.cern.sam-ROC_CRITICAL for ops: http://mon.egi.eu/myegi/sa/?view=2&graph=1&vo=104&profile=26&filters-value-Regions_or_Tiers=115&filters-value-Sites=&production=1&preproduction=1&dateorperiod=pd&period=pM&startdate=01-08-2015&enddate=30-09-2015 * https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-CVE-2015-7183 * Survey on "Quality process and ISO certification" (Quality Management, IT Service Management, Information Security Management): https://www.surveymonkey.com/r/isocertification ---++ Other topics * Daniel being replaced as CMS contact person * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: Pablo, Dario, Dino, Gianni * CMS: Fabio Martinelli, Daniel Meister * ATLAS: Gianfranco, Luis March * LHCb: Roland Bernet * EGI: Gianfranco ---++ Action items * Item1
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingSwissGridOperations20151110
Topic revision: r19 - 2015-11-11 - FabioMartinelli
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback