MeetingSwissGridOperations20190307 < LCGTier2

Tags: view all tags
&lt;!-- keep this as a security measure:<br />* Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup<br />* Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup<br />#uncomment this if you want the page only be viewable by the internal people<br />#* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup<br />--&gt;

---+ Swiss Grid Operations Meeting on 2019-03-07 at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598)
   * *External link*: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
   * *Switch Vidyo SIP IP*: 137.138.248.204
%TOC%

---++ Site status
---+++ CSCS

Systems
   * <p>Phoenix: all nodes idle, load moved to Daint</p>
   * <p>ARC CEs reinstalled with new HW, only arc04 missing. Scheduled reinstallation of arc04 on Monday</p>
   * <p>Testing new squid and scheduled reinstallation of cvmfs and cvmfs1</p>
   * <p>Old compute nodes distribution?</p>
      * lhcb want 2x Chassis wiht 4x nodes each
      * Atlas will take around 100x nodes
Storage<br />dCache
   * normal operation
   * preparing lab for new design + upgrade
   * complete storage migration will start in April
GPFS
   * normal operation
   * Upgraded firmware on DELL SC9000 (SSD tier1)
   * next week: rolling / online network interface replacement (from IB to Ethernet)
   * upgrade to GPFS 5.0.2
   * Slow tier migration in April from DDN SFA12k to DELL SC9000

---+++ PSI
   * Xxx
   * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]]

---+++ UNIBE-LHEP
   * <p>Ramping down LHEP in view of the cluster re-deployment</p>
   * Monthly summary: Pledged: 18k, delivered 18k
   * Ubelix contributing &gt;50% (23% typical)
   * Running an average &gt;1850 slots (2500 typical)<br /><br /><img alt="" height="257" src="http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=UNIBE-LHEP&sitesCat=All Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2019-02-01&end=2019-02-28&timeRange=daily&granularity=8 Hours&generic=0&sortBy=16&series=All&type=ewa" width="342" /><br /><br />
   * <b>6-month history UniBE (pledge: 18 kHS06)<br /><br /></b>
<img alt="" height="333" src="http://dashb-atlas-job.cern.ch/dashboard/request.py/resourceutilization_individual?sites=UNIBE-LHEP&sitesCat=All Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2018-09-01&end=2019-02-28&timeRange=daily&granularity=Monthly&generic=0&sortBy=16&diag1=0&diag2=0&diag3=0&diag4=0&diag5=0&diag6=0&diag7=0&diag8=0&diagT=0&diag8pl=0&series=All&type=wchs" width="444" /><br /><br />
   * *Accounting numbers (from scheduler) from last month, LHEP only*
      * Omitted this month
---+++ <b>Swiss ATLAS statistics<br /><br /></b>
   * *Hammercloud availability:*
   * <img alt="" height="231" src="%ATTACHURL%/HammerCloud.png" width="643" /><br /><br />
      * ANALY_CSCS-HPC: 95%
      * CSCS-LCG2-HPC_MCORE: 94.5%
      * UNIBE-* : 100%<br /><br />
   * Running slots
      * Large number of stuck jobs on ARC skew the statistics for CSCS, creating reporting problems to ATLAS
      * Very likely due to the reported issues with the Daint scratch file system, affecting WLCG jobs in some way<br /><br />
      * This required a laborious manual clean up:
         * job list provided by ATLAS, culled from the aCT
         * manual cleanup of the ARC sessiondir carried out by Miguel
      * <br /><img alt="" height="350" src="http://dashb-atlas-job.cern.ch/dashboard/request.py/jobnumbers_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2019-02-01&end=2019-02-28&timeRange=daily&granularity=8 Hours&generic=0&sortBy=0&series=All&type=rmulticores" width="466" /><br /><br /><br />
   * <b>Accounting Numbers from the ATLAS dashboard (February 2019) CSCS+UNIBE</b><br />%EDITTABLE{}%
| *Cluster* | *Job Type* | *Produced WC core-hours* | *Good vs Bad WC %* | *CPU eff good jobs %* |
| CSCS | Any | 3'088'664; 72% | 0.52 | 0.70 |
| UniBe | Any | 1'149'338; 28% | 0.75 | 0.75 |
 <br /><img alt="" height="305" src="http://dashb-atlas-job.cern.ch/dashboard/request.py/resourceutilization_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2019-02-01&end=2019-02-28&timeRange=daily&granularity=8 Hours&generic=0&sortBy=0&diag1=0&diag2=0&diag3=0&diag4=0&diag5=0&diag6=0&diag7=0&diag8=0&diagT=0&diag8pl=0&series=All&type=wchs" width="407" />

<img alt="" height="272" src="http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2019-02-01&end=2019-02-28&timeRange=daily&sortBy=0&granularity=8 Hours&generic=0&series=All&type=qbwc" width="388" /><br /><br /><img alt="" height="240" src="http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2019-02-01&end=2019-02-28&timeRange=daily&granularity=8 Hours&generic=0&sortBy=0&series=All&type=ewa" width="319" /><img alt="" height="236" src="http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2019-02-01&end=2019-02-28&timeRange=daily&granularity=8 Hours&generic=0&sortBy=0&series=All&type=ewg" width="315" /><br /><br />
   * Take home lessons from the last month:
      * Failed WC very high, we need some more real time alerts
      * Public dashboard replica offline for a while
      * ATLAS now relies *fully* on ARC services:
         * we need ARC metrics and/or logs
         * some ideas by Dino, but such implementation is not a short term affair
         * what can we do in the meanwhile?
         * we need monitoring/(nagios, graylog?) automated checks on ARC services<br /><br />
         * Please report general Daint issues that could affect WLCG jobs (SLACK-general or email) so that we can react if needed
      * ...
---+++ *UNIBE-ID*
   * End of Feb: Reinstallation of ARC CE on nordugrid.unibe.ch AKA UNIBE LHEP_UBELIX:
      * Reason: EL6 -&gt; EL7
      * smooth transition within ~4h
      * no issues after reinstallation
---+++ *UNIGE*
   * Xxx
   * Accounting numbers (from scheduler) from last month <p> </p>
---+++ NGI_CH
   * Xxx
   * NGI-CH Open Tickets review <p> </p>
---++ Other topics
   * Topic1
   * Topic2

Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS:
   * CMS:
   * ATLAS:
   * LHCb:
   * EGI: <p> </p>
---++ Action items
   * Item1