MeetingSwissGridOperations20150702 < LCGTier2

<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2015-07-02
   * *Date and time*: First Thursday of the month, at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS
   * Operations: 
      * dCache overall status
      * CMS !PhEDEx reinstallation status
      * Tickets: 
         * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=405][405 | CMS | T2_CH_CSCS Phedex agents down]]:
         * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=402][402 | CMS | T2_CH_CSCS with CE critical for &gt; 13 hours]]:
         * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=398][398 | CMS | space monitoring at T2_CH_CSCS]]: lower priority
         * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=397][397 | CMS | T2_CH_CSCS - links]]:
         * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=403][403 | LHCb | CPU efficiency at CSCS-LCG2]]: Difficult to identify what's going on as the output from the job cannot be obtained.
         * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=388][388 | none | Missing Accounting Date for APril 2015]]: Linked to internal WebRT #19446 and #19946
      * VO specific tickets:

---+++ PSI
   * Fabio will be on leave until 6th July

---+++ UNIBE-LHEP
   * <strong>Operations</strong>
      * Still bumpy and at about half capacity
      * Restored 320 old cores (from Ubelix), but many tend to crash
      * One more aircon issue (22th June). Many nodes lost power. Working on temp monitor system in the room (some input from PSI too, thanks!)
      * Likely related to aircon problem: lustre disks on 3 nodes went flaky.
      * More issues with nodes crashing on both clusters. Most of the times jobs remain in state "dr" in gridengine. These somehow prevent new jobs from being submitted (these remain in PREPARING state in ARC). Now added cron to clean these up and log the nodenames. 
      * LAN down on ce01 on (Friday) 26th June. Very likely a hardware failure, but in the rush to reset the cluster online, failed to really establish whether it was really the case. Recovery: swap to unused network interface, register network and interface changes in ROCKS, redploy Lustre from scratch, power-up and re-install stuck nodes.
   * *ATLAS specific operations*
      * gridengine multicore scheduling improved. Changes to gridengine already in place a month ago, but success seemed limited. In addition, removed on one cluster a hack to scale up the requested walltime by 1.4/1.5. Increased difficulty in scheduling multicore jobs possibly explained by some ATLAS tasks with quiet high walltime.
      * All ARC failures still masked by crons. But bugfix release in apel-testing, will try to upgrade soon.
   * <strong>Ongoing work</strong>
      * ROCKS 6.2 just came out, prototyping the cluster deployment chain with this version now (CE, WN, lustre mds, lustre oss)
      * 6 IBM servers from CSCS collected and rackmounted. Will be deployed upon re-installation of ce01 (ROCKS 6.2)
      * Temperature monitoring in server room under work. New water-cooled rack by Theoretical Psysics monitors the inlet water temperature. Add ambient sensors in some racks. Monitor first to learn trends, try to automate in the future (e.g. drain clusters upon inlet water temperature over threshold)
---+++ UNIBE-ID
   * Michael/Nico cannot attend due to delivery of ESS ;-)
   * *Operations*: 
      * smooth, high usage currently
   * *ATLAS-related*: 
      * mcore jobs now better scheduled; changes made 
         * resource reservation only set for mcore jobs (within submit-sge-jobs when priority is set)
         * increased max_reservation in scheduler conf from 7 to 32
         * default_duration in scheduler conf now increased from 24h to 97h == h_rt limit of queue where ATLAS jobs are running
      * ATM: WARNING in gridka nagios regarding latest EGI-trustanchors release 
         * IGTF-1.65, 0 days old, all present. - SHA Fingerprint failed for ca-policy-lcg. - SHA Fingerprint failed for ca-policy-egi-core
         * Is this a broken release?
   * *UBELIX Puppet Resources* 
      * As mentioned at hpc-forum presentation we now have a public platform for OSS stuff: 
         * http://idos-code.unibe.ch - Stash with most of our puppet modules
         * http://idos-issues.unibe.ch - Jira, our issue tracker for the code above
         * Clone as you like. :-) Contributions (aka pull requests) are welcome as soon as we have our Crowd instance ready (end of week) - don't register yet though it's possible.

---+++ UNIGE
   * Still un-manned, likely until 1st October 2015

---+++ NGI_CH
   * Certificates: http://www.lhep.unibe.ch/sits/certificates.html
   * UNIBE-LHEP bad performance April, May 2015: bad SE for ops (May: <a target="_blank" href="https://documents.egi.eu/public/ShowDocument?docid=2519">https://documents.egi.eu/public/ShowDocument?docid=2519</a>)
   * <span style="background-color: transparent;">NGI_CH - May 2015 - RP/RC OLA performance: https://ggus.eu/index.php?mode=ticket_info&ticket_id=114449</span>
   * <span style="background-color: transparent;">ARC staged rollout?</span>
---++ Other topics
   * Topic1
   * Topic2
Next meeting date:

---++ A.O.B.

   * *Reminder:* Face To Face meeting to be held on 21 August 2015 at CSCS.

---++ Attendants
   * CSCS: Miguel Gila
   * CMS:
   * ATLAS: Gianfranco Sciacca
   * LHCb:
   * EGI: Gianfranco Sciacca

---++ Action items
   * Item1
This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingSwissGridOperations20150702
Topic revision: r5 - 2015-07-01 - GianfrancoSciacca