MeetingSwissGridOperations20160407 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2016-04-07 at 15:30
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS
   * New ARC CE instance (arc03) installed along with a new SLURM instance (15.08.8) and all the recently purchased WNs<br />(this cluster is integrated into CSCS LDAP and central SLURM DB)
   * certificates mess last week (Gianni's fault!): thanks to Gianfranco and Sigve for their help
   * some time spent fixing the Information System [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=118922][(GGUS 118922)]]
   * tentative planned maintenance on 20160503 to replace IB/Eth bridges, moving some VMs, reinstalling arc02
   * CREAM CEs to be dismissed by beginning of June
   * Accounting numbers (from scheduler) from last month
*GPFS*
   * No issues to report
   * Metadata from local SSD to FC Flash migration should be performed on May 3rd
*dCache*
   * Almost ready to deploy the first 500TB of new storage (from NETAPP 5560)
   * The additional 500TB will be ready by the first part of May (from SFA12K)
   * Investigating some "unexpected" files deletion (CMS)

---+++ PSI
   * Put in production the new CentOS7/ZFS/NFSv4 /homes hierarchy 
      * [[http://zfsonlinux.org/][ZFS On Linux]]
      * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/space.report][/homes hierarchy space report]]
      * [[http://t3mon.psi.ch/ganglia/host_gmetrics.php?c=PSI%20Tier3%20services&h=t3nfs01.psi.ch][/homes hierarchy Ganglia ZFS/NFSv4 stats]]
   * Installing 9 new Dalco servers ( got 2 disks dead on arrival ) ; each : 
      * Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 64 cores ( HT on )
      * 128GB RAM
      * 4 disks 900GB 10k SAS in mdadm 1+0 by Kickstart
      * made a 100GB partition formatted as XFS in order to test [[https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-fscache.html][FS-Cache]] + NFSv4
   * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]]

---+++ UNIBE-LHEP

*Operations*
   * <span style="background-color: transparent;">mostly stable operation on both systems, except for:</span>
   * some random failures on some ce01 nodes ( *trans:* Transformation not installed in CE) 
      * leads to flipping between black and white-listing by HC
      * usually a cvmfs related problem, but cvmfs reports fine on all nodes
      * under investigation right now
   * eth0 dropped twice within 12h on the ce01 lustre mds:
<span style="background-color: transparent;">Mar 31 08:26:14 mds-2-1 kernel: irq 75: nobody cared (try booting with the "irqpoll" option)</span><br /><span style="background-color: transparent;">...</span><br /><span style="background-color: transparent;">Mar 31 08:26:31 mds-2-1 kernel: e1000e 0000:03:00.0: eth0: Reset adapter unexpectedly</span>

<span style="background-color: transparent;"><br /></span>
   * leaves lustre hanging, <span style="background-color: transparent;">needs power-cycling to recover (lustre come back is quick)</span>
   * <span style="background-color: transparent;">maybe flacky h/w, getting a spare card to plug in case or recurrance</span>
*ATLAS specific operations*
   * HC online 33% (last month, single core only - not huge impact since over 80% of work is MCORE):
<span style="background-color: transparent;">http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE</span>

<span style="background-color: transparent;"><br /></span>
   * 63% of ATLAS/CH WT, 70% CPUtime in March:
<span style="background-color: transparent;">http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites[]=CSCS-LCG2&sites[]=UNIBE-LHEP&sitesCat[]=CH-CHIPP-CSCS&resourcetype=All&sitesSort=2&sitesCatSort=2&start=2016-03-01&end=2016-03-31&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All&activities[]=all</span>

   * Still on ice: No progress on the storage dumps requested by ATLAS (due to no progress in the re-deployment of the DPM head node on SLC6) 
      * but I have asked to re-discuss this within ADC (in my view this sh<span style="background-color: transparent;">ould be implemented at the middleware level)</span>
   * <span style="background-color: transparent;">UNIBE-LHEP_CLOUD and UNIBE-LHEP_CLOUD_MCORE operating stably</span>

   * <span style="background-color: transparent;"> *Accounting numbers (from scheduler) from last month (Mar 2016)* - *NOTE*: ce03/CLOUD not reported yet</span> 
      * <span style="background-color: transparent;">WC h: 936908 (ATLAS) - 149450 (t2k.org) - 13838 (uboone) - 13 (ops)</span>
   * *Accounting numbers (from ATLAS dashboard) from last month* (Mar 2016) 
      * CPU h: 672148 (933386.8 with cloud)
      * WC h: 909450 (1243195.7 with cloud)

---+++ UNIBE-ID
   * All servers (but one) moved from RHEL to CentOS and all puppetized - finally
   * Short storage outages in March 
      * in Feb Upgrade ESS-3.0 (GPFS-4.1.0) =&gt; ESS-3.5 (GPFS-4.1.1)
      * =&gt; GPFS cluster overload in certain moments =&gt; Stale File Handles
      * Turned off certain logging/tracing facilities in GPFS
      * now perfectly stable since 3w again
   * Ordered additional 76 nodes to 32 nodes we ordered last December: 
      * Intel Xeon E5-2630v4 @ 2.2GHz, 20 cores (HT off)
      * 128GB RAM
      * =&gt; homogenous queue with 108 nodes (2160 core) exclusively for MPI usage
   * *Accounting numbers (from scheduler) from last month (Mar 2016):*
      * CPU h: 195476
      * WC: h: 67481
---+++ UNIGE

   * <span style="background-color: transparent;">Production:</span> 
      * <span style="background-color: transparent;">Running smoothly under test mode for ATLAS (still pending some checks)</span>
      * <span style="background-color: transparent;">High load of cluster from local users (need to check batch system closer, since more chances of nodes down)</span>
      * <span style="background-color: transparent;">Host certificates recently replaced for DPM Head and Disk nodes + ARC-CE (running late because e-mails were sent to Szymon)</span>
   * Storage: 
      * <span style="background-color: transparent;">ATLASLOCALGROUPDISK space token was almost full, now (after some cleaning of old datasets) it is at ~ 75% full (~106 TB free)</span>
      * <span style="background-color: transparent;">Only one user from </span>UniGe<span style="background-color: transparent;"> with useful dataset at CSCS, moving datasets to </span>UniGe<span style="background-color: transparent;">. Then, merge ATLASLOCALGROUPDISK with ATLASSCRATCHDISK</span>
      * <span style="background-color: transparent;">Providing ATLAS storage dumps every month</span>
   * <span style="background-color: transparent;">Outlook:</span> 
      * <span style="background-color: transparent;">3 User Interfaces with SLC5 will be decommissioned and maybe a good chance to start moving to </span><span style="background-color: transparent;">CentOS</span>
   * Accounting numbers (from scheduler) from last month (Files attached for <span style="background-color: transparent;">[[https://wiki.chipp.ch/twiki/pub/LCGTier2/MeetingSwissGridOperations20160407/g07.201602.log][Feb 2016]] </span><span style="background-color: transparent;">and [[https://wiki.chipp.ch/twiki/pub/LCGTier2/MeetingSwissGridOperations20160407/g07.2016.log][Jan-Feb 2016]])</span>

---+++ NGI_CH

   * <span style="background-color: transparent; color: green;">Nothing of relevance</span>
   * <span style="background-color: transparent;">NGI-CH Open Tickets review </span><ins> * <span style="background-color: transparent;">NGI-CH Open Tickets review</span> </ins><span style="background-color: transparent;"> * </span> 
      * <span style="background-color: transparent;">CSCS-LCG2</span>
   * 
      * 
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120551">120551</a>: CSCS-LCG2_MCORE : 75%+ jobs failed with ... (ATLAS team) - Not fully fixed yet (blacklisted right now, some HC jobs do not run)
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120505">120505</a>: Large amount of GLEXEC ERRORS on T2_CH_C.. (CMS) - Not touched for a week, changed to "waiting for reply"
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120405">120405</a>: Problem with accessing files at CSCS-LCG... (LHCb team) - In progress
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=119171">119171</a>: Workflow failures at T2_CH_CSCS (CMS) - Changed to "waiting for reply"
      * UNIBE-LHEP 
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120257">120257</a>: glidein validation errors for Microboone... (UBOONE) - Following up on OSG, this should be closed
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a>: ATLAS request- storage consistency check... (ATLAS) - On hold
      * NGI_CH 
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120184">120184</a>: NGI_CH - February 2016 - RP/RC OLA performance - Slow response to 2 tickets (average March response 8.51): 
            * <a target="_blank" href="https://ggus.eu/?mode=ticket_info&ticket_id=120045">https://ggus.eu/?mode=ticket_info&ticket_id=120045</a> (LHCb on arcbrisi)
            * <a target="_blank" href="https://ggus.eu/?mode=ticket_info&ticket_id=120293">https://ggus.eu/?mode=ticket_info&ticket_id=120293</a> (duplicate of the above, handled immediately, so: ???)
         * "please remind to set the proper status when handling the tickets"
         * replied to it now

---++ Other topics
   * Topic1
   * Topic2
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS: Pablo, Dario, Dino, Gianni
   * CMS: Fabio
   * ATLAS: Luis
   * LHCb: Roland
   * EGI:

---++ Action items

* Item1 * Item1* Item1 * Item1