MeetingSwissGridOperations20151210 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2015-12-10
   * *Date and time*: 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS
   * *Storage* 
      * <span style="background-color: transparent;">dCache: stable but still have to run the cleaner manually. Upgrade to 2.10 will be performed on Wed 13th Jan 2016</span>
      * <span style="background-color: transparent;">Atlas: working on the monthly dumps</span>
      * <span style="background-color: transparent;">GPFS (scratch): nothing to report</span>
      * <span style="background-color: transparent;">New hardware: 4 server for dcache and ~1PB of storage. Working to move GPFS metadata disk on Flash based storage.</span>
   * *Compute* 
      * <span style="background-color: transparent;"><em>Added some check function to nodehealtcheck:</em><br /></span> 
         * <span style="background-color: transparent;">SWAP cleaner<br /></span>
         * <span style="background-color: transparent;">auto solve</span><span style="background-color: transparent;"> some blakhole scenarios like auto remount fs</span>
         * <span style="background-color: transparent;">after 60 + random number of days the node is putted in dreain for clean and reboot<br /></span>
      * Started some test with new slurm version, to migrate sltop.
      * Today we will order 40 new compute node with E5-2680v4

---+++ PSI
   * Xxx

---+++ UNIBE-LHEP
   * *Operations* 
      * ce01 cluster re-installation virtually completed (about 900 worker cores running, 120 still to be installed, 256 awaiting delivery)
      * Started with a simple slurm setup (slurm-15.08.1) in order to cut down on commissioning time: one partition with<br /> <verbatim>SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
MemLimitEnforce=no</verbatim>
      * We don't over-subscribe memory anymore: nodes don't starve and crash
      * Memory usage is properly accounted for in 15.08 (PSS): no jobs killed on (artificial) over-limit of "vmem" (now the full address space reserved by a process, no what's allocated or used)
      * Comparing job fail rates between ce01 and ce02 (still on old SGE) has convinced me to rush the re-installation of ce02 (started earlier today)
   * *ATLAS specific operations* 
      * Stable worflows by ATLAS (very large improvement since beginning of run II)
      * Stuck with the implementation of monthly dumps of the namespace on the DPM SE: 
         * headnode on SLC5: the dump script does not work and also generating a valid proxy is problematic
         * decided to push the re-deployment of the head node on SLC6
         * legacy config tool (YAIM) no longer supported
         * puppet based configuration, got the right docs at the DPM workshop earlier this week in CERN
         * tests ongoing on a pps VM
         * also complicated by the fact my site-bdii is still co-located with the DPM head node
         * this will likely be the first task for 2016
---+++ UNIBE-ID
   * Xxx

---+++ UNIGE
   * *Operations*
   * 
      * atlasfs29.unige.ch : New certificate
      * Another File Server has been already installed, but this is for DAMPE experiment (no host certificate needed)
      * We have new hardware to be installed at the cluster: File Servers and a couple of PCs for services
      * We will install puppet for DPM and probably cluster configuration and setup: Let's say we will use a testbed with atlasfs29 + 1 PC of service (1 out of 2, of the previous ones mentioned just above)
   * *Network - Outlook* 
      * We intend for a new network switch of 10 Gb/s, but this is still under negotiation
      * Most likely, it will be in the beggining of next year
   * *Storage* 
      * There wass a DPM SE workshop at CERN on December 7th-8th: <a target="_blank" href="https://mmm.cern.ch/owa/redir.aspx?C=A5Ciw3Yy_0igvsnpDRi7YE_8ZVAfBtMISpSD7CmKwnmO8HN8bNwD0QTHlTviRdJd79RAAEH3jzI.&URL=https%3a%2f%2findico.cern.ch%2fevent%2f432642%2f">https://indico.cern.ch/event/432642/</a>
      * Checking the data stored at the DPM SE for cleaning purposes, since ATLAS requested it
      * Checking data in order to identify files which are registered in the catalogue (rucio), but not physically at the DPM SE and vice versa

---+++ NGI_CH
   * Nothing to report

---++ Other topics
   * Proposal to add to this meeting: T2 monthly pledge review (CSCS, UNIBE); GGUS open ticket review
   * Coverage over the holiday season
   * 
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS:
   * CMS:
   * ATLAS:Gianfranco,Luis March
   * LHCb:
   * EGI:Gianfranco

---++ Action items
   * Item1