MeetingSwissGridOperations20160204 < LCGTier2

<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2016-02-04 at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS
   * <strong>STORAGE</strong><br /><br /><strong>Hardware / Physical install</strong><br />- 8 Feb: new dCache servers (4x)<br />- 8 Feb: MPO in order to connect Phoenix to the CSCS SAN<br />- 9 Feb: NETAPP E5660 (~0.5PB)<br /><br /><strong>dCache</strong><br />- The &lsquo;cleaner problem&rsquo; (mainly affecting CMS) is no more present. Space is freed automatically as expected<br />- Atlas dumps in place, something to adjust for 'atlasgroupdisk/perf-egamma' and 'atlasscratchdisk&rsquo; ( https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=428 )<br /><br /><strong>GPFS</strong><br />- Unplanned maintenance was needed on Wed 3rd Feb in order to recreate the filesystem because of a metadata inconsistency problem.
   * <span style="background-color: transparent;"> *Systems* </span>
<div style="padding-left: 60px;" id="_mcePaste">- Preparing and consolidating racks for new arrivals end of this month</div> <div style="padding-left: 60px;" id="_mcePaste">- Checking published values of HEPspec</div> <div style="padding-left: 60px;" id="_mcePaste">- Tuned slurm config to improove cluster performance</div> <div style="padding-left: 60px;" id="_mcePaste">- Fixed two HP nodes, one of them whit IB failures and the other the 1G man network card</div> <div style="padding-left: 60px;" id="_mcePaste">- Testing complete Puppet installation for worker nodes, is working fine, i have just to check some cvmfs parameters and cream wrapper script.</div>

   * Accounting numbers (from scheduler) from last month 
      * http://ganglia.lcg.cscs.ch/ganglia/SLURM_REPORTS/phoenix_slurm_report_201601.txt

---+++ PSI
   * Xxx
   * Accounting numbers (from scheduler) from last month

---+++ UNIBE-LHEP

*Operations*
   * Nothing significant to report; stable operation on both systems
   * 256 new cores delivered yesterday, hope to deploy before weekend
*ATLAS specific operations*

   * <span style="background-color: transparent;">No progress on the storage dumps requested by ATLAS (due to no progress in the re-deployment of the DPM head node on SLC6)</span>
   * <span style="background-color: transparent;">ANALY_UNIBE-LHEP blacklisted in HC: no time to debug but low impact since right now ANALY jobs aren't too many</span>
   * <span style="background-color: transparent;">A couple of stabile weeks of operation for UNIBE-LHEP_CLOUD_MCORE, then we lost the cluster and could not fix it yet</span>
<strong>Accountin</strong><strong>g</strong>
   * Accounting numbers (from scheduler) from last month (Jan 2016) 
      * CPU h: 792492 (ATLAS) - 12671 (t2k.org) - 1879 (uboone) - 25 (ops)
   * <span style="background-color: transparent;">Accounting numbers (from ATLAS dashboard) from last month (Jan 2016)</span> 
      * CPU h: 662466 (774848 with cloud)
      * WC h: 679368 (796292 with cloud)

---+++ UNIBE-ID
   * Xxx
   * <span style="background-color: transparent;">Accounting numbers (from scheduler) from last month</span>

---+++ UNIGE

*Operations*
   * Running smoothly: Higher user activity since last meeting
   * Grid (ATLAS) jobs: UNIGE-DPNC in "Test" status and ~ 1/3 oj jobs failed due to (apparently) "ran out of memory". Need checks
   * We plan a scheduled downtime at some point: Needed for upgrading system and security (related to get involved for ATLAS production also)
*Storage*
   * Dump of DPM SE for ATLAS experiment finally submitted (this dump should be provided once a month)
   * In addition to these ATLAS checks, we should clean our DPM: Old user data and other projects (To Be Done)
*Outlook*
   * Request for new network switch upgrade to 10 Gb/s + adquisition of 3 GPUs already submitted (wait for resolution in ~ March 2016) 
      * GPU info (nvidia): http://www.microspot.ch/msp/fr/pc-komponenten/grafikkarten/gainward-geforce-gtx-980-grafikkarten-gf-gtx-9-0000948922
      * A more detailed description of the GPU requested:
         * TYAN B7079F77CV10HR-N 2X10C - 256GB - 4XGTX980 - 64GB
         * <span style="font-family: Tahoma; font-size: 10pt;">4U, FT77C, C612</span>
         * <span style="font-family: Tahoma; font-size: 10pt;">(10) 2.5"" Hot-Swap bays,</span>
         * <span style="font-family: Tahoma; font-size: 10pt;">(8) PCI-E G3 x16, for NV GPU cards,</span>
         * <span style="font-family: Tahoma; font-size: 10pt;">3200W(2+1) 80+ platinum"</span>
         * <span style="font-family: Tahoma; font-size: 10pt;">2x Intel Xeon E5-2620v3 Six Core</span>
         * <span style="font-family: Tahoma; font-size: 10pt;">4x Samsung 16GB, DDR4-DIMM, PC4-17000 (2133MHz), registered, ECC &bull; Low Voltage (1.2V)</span>
         * <span style="font-family: Tahoma; font-size: 10pt;">1x Samsung SSD 850 PRO 256GB</span>
         * <span style="font-family: Tahoma; font-size: 10pt;">8x Gainward GTX980, 4GB DDR5, PCI-E16x3.0 </span>
   * Install puppet for DPM SE (and probably also for cluster configuration and setup, replacing yaim)
*Accounting*
   * Accounting numbers (from scheduler) from last month

---+++ NGI_CH
   * Nothing to report
   * NGI-CH Open Tickets review
https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO

   * 
      * CSCS-LCG2 
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117786">117786</a> (ATLAS: storage dumps) almost done - should fix two paths
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=119021">119021</a> (LHCb team: jobs failed) no information provided - changed to "waiting for reply"
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=119171">119171</a> (CMS: Workflow failures) in progress
      * UNIBE-LHEP 
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> (ATLAS: storage dumps) on hold
      * NGI_CH 
         * <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=118922">118922</a> (affects CSCS-LCG2 and UNIBE-LHEP): GlueSubClusterPhysicalCPUs, GlueSubClusterLogicalCPUs in the bdii - added explicit notification to CSCS-LCG2
---++ Other topics
   * Topic1
   * Topic2
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS:
   * CMS:
   * ATLAS: Luis March
   * LHCb:
   * EGI: Luis March

---++ Action items
   * Item1
This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingSwissGridOperations20160204
Topic revision: r12 - 2016-02-04 - LuisMarch