(r6) SwissGridOperationsMeetingOn20130808 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup
-->

---+ Swiss Grid Operations Meeting on 2013-08-08
   * *Date and time*: First Thursday of the month, at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9227296)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=Nrq24qRR4V1u
   * *Phone gate*: From Switzerland: 0225330322 (portal) + 9227296 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

---++ Agenda

Status
   * CSCS (reports George): 
      * Storage01 root partition became full on the 29th July resulting in failed transfers. A full reboot of the machine was required to get it back into service. 
         * Restarting the dCache services resulted in the used space reported by the operating system to return to the normal ~20%
         * The next day we noticed disk space was still growing caused by dCache creating trace files under /tmp 
            * Altering the log levels within the dCache CLI/ pcells didn't have any affect.
            * For the time being we are log rotating this files
            * The relavent line in the logback.xml has been changed so when dcache is next restarted these files will not be created.
      * We have just provisioned slurm1, slurm2 and cream04 to begin testing Slurm/ EMI3
      * Achieved a 99% availability and 100% reliability in the tier 2 report for July http://sam-reports.web.cern.ch/sam-reports/2013/201307/wlcg/WLCG_Tier2_OPS_Jul2013.pdf

   * PSI (reports Fabio): 
      * Good news:
         * I've physically installed our new [[http://www.netapp.com/us/products/storage-systems/e5400/e5400-tech-specs.aspx][NetApp E5460 360TB raw]]; If you never saw a NetApp E5460 look this [[http://www.youtube.com/watch?v=n5ULb2OPFD8][youtube]].
         * RAID6 creation ( took ~4 days ), VolGroups, Vols and FC Hosts were *automatically* created by simply loading the configuration of our other E5460 ( I saved a lot of time ! ).
         * I'm preparing an SL6 + RDAC + FC installation to stress the E5460 before to merge it into our production environments.
         * [[http://support.netapp.com/][NetApp support]] was unresponsive and really remote; it's based in India, so *answers sent in their Timezone*; it took days to get a not guest account, download Santricity and activate/map to my ID the E5460 serial number. 
         * I did not have time to try but the [[http://www.netapp.com/us/services-support/autosupport.aspx][NetApp AutoSupport]] service to be remotely monitored by NetApp looks nice.
      * Bad news: *Our T3 is partially down since 5th Aug* ! returning from a Sunday in Italy it failed:
         * The Milano -> Zurich train, I'm arrived @ home 3am !
         * On 5th Aug, a Solaris dCache server X4540 =t3fs07= got frozen, I rebooted it and Solaris 10 could not boot because the Flash Card was failed + 3 broken SATA disks in the server such that 2 of them in the same =raidz2= + again 1 of them producing an endless =Disconnected command timeout for Target 1=.
         * In the meantime 2 disks were failing in an other server X4540 =t3fs11=, again in the same =raidz2= ..
         * to close this chain, got an other disk failed in an other server X4540 =t3fs10=, but that was easy to fix.
         * To fix I've:
         * Installed Solaris Express 11 into a new Flash Card, *precious inheritance coming from a CSCS X4540* and booted the new =t3fs07=.
         * Stopped dCache on =t3fs11= to avoid an additional load, and because ZFS was already in rebuilding using 2 spares I let it run and once done I've promoted the 2 spares as 2 new pool disks.  

   * UNIBE (reports Gianfranco - please note: I will not attend the meeting): 
      * ce.lhep cluster (older CentOS 5) upgraded to ARC 3.0.2 
         * obscure bug causing slapd not to start with no trace of error (unless you increase log level to 256 (!) *and* instruct syslog to turn the infosys log on) Actual bug is here: [[https://bugzilla.nordugrid.org/show_bug.cgi?id=3226]]
         * infoprovider does not work if stale job files (typically from previously failed jobs) exists in the controldir. After some long debugging, understood the problem and Andrej provided a cleanup script.
      * ce.lhep cluster will grow ~4x in size with nodes from the HLT farm in CERN. At the same time (Sept), it will be re-installed with SLC6
      * ce01.lhep cluster (newer SLC6): problems reported previously not permanently solved yet (issue wuth Thumpers lockup, CVMFS cache becoming full). Still ready to move lustre to IB, but the transition will require both me and Andrej to be around for some days at least (which has not happened since last month's meeting)
   * UNIGE (reports Szymon): 
      * First batch worker node running SLC6 put in operation. Most jobs are OK.
      * Hardware procurement for the 2013 upgrade is under way. 
         * Replace Solaris in DPM (six machines, 96 TB net)
         * with new hardware (IBM x3630 M4, 4 machines, 172 TB total).
      * !CernVM file system was set up. We use NFS do deploy it.
      * Adaptation of operational procedures, especially the cleanup of "dark data", to the new version of the ATLAS Distributed Data Management software "Rucio".
      * Hardware failures: 
         * A few disk failures on IBM x3630 M3 and Sun X4500+ machines
         * Memory errors on one Sun X4540 disk server
         * Failure of hardware raid on one IBM x3630, due to overheating
      * Automatic cleanup of /tmp is affecting very long jobs. Files are removed after 5 days. We still don't understand why. It is not the tmpwatch.
   * UZH (reports Sergio): 
      * Xxx
   * Switch (reports Alessandro): 
      * Xxx
Other topics
   * Topic1
   * Topic2
Next meeting date:

AOB

---++ Attendants
   * CSCS:
   * CMS: Daniel
   * ATLAS:
   * LHCb:
   * EGI:

---++ Action items
   * Item1