Swiss Grid Operations Meeting on 2013-08-08

Agenda

Status

  • CSCS (reports George):
    • Storage01 root partition became full on the 29th July resulting in failed transfers. A full reboot of the machine was required to get it back into service.
      • Restarting the dCache services resulted in the used space reported by the operating system to return to the normal ~20%
      • The next day we noticed disk space was still growing caused by dCache creating trace files under /tmp
        • Altering the log levels within the dCache CLI/ pcells didn't have any affect.
        • For the time being we are log rotating this files
        • The relavent line in the logback.xml has been changed so when dcache is next restarted these files will not be created.
    • We have just provisioned slurm1, slurm2 and cream04 to begin testing Slurm/ EMI3
    • Achieved a 99% availability and 100% reliability in the tier 2 report for July http://sam-reports.web.cern.ch/sam-reports/2013/201307/wlcg/WLCG_Tier2_OPS_Jul2013.pdf

  • PSI (reports Fabio):
    • Xxx
  • UNIBE (reports Gianfranco - please note: I will not attend the meeting):
    • ce.lhep cluster (older CentOS 5) upgraded to ARC 3.0.2
      • obscure bug causing slapd not to start with no trace of error (unless you increase log level to 256 (!) and instruct syslog to turn the infosys log on) Actual bug is here: https://bugzilla.nordugrid.org/show_bug.cgi?id=3226
      • infoprovider does not work if stale job files (typically from previously failed jobs) exists in the controldir. After some long debugging, understood the problem and Andrej provided a cleanup script.
    • ce.lhep cluster will grow ~4x in size with nodes from the HLT farm in CERN. At the same time (Sept), it will be re-installed with SLC6
    • ce01.lhep cluster (newer SLC6): problems reported previously not permanently solved yet (issue wuth Thumpers lockup, CVMFS cache becoming full). Still ready to move lustre to IB, but the transition will require both me and Andrej to be around for some days at least (which has not happened since last month's meeting)
  • UNIGE (reports Szymon):
    • First batch worker node running SLC6 put in operation. Most jobs are OK.
    • Hardware procurement for the 2013 upgrade is under way.
      • Replace Solaris in DPM (six machines, 96 TB net)
      • with new hardware (IBM x3630 M4, 4 machines, 172 TB total).
    • CernVM file system was set up. We use NFS do deploy it.
    • Adaptation of operational procedures, especially the cleanup of "dark data", to the new version of the ATLAS Distributed Data Management software "Rucio".
    • Hardware failures:
      • A few disk failures on IBM x3630 M3 and Sun X4500+ machines
      • Memory errors on one Sun X4540 disk server
      • Failure of hardware raid on one IBM x3630, due to overheating
    • Automatic cleanup of /tmp is affecting very long jobs. Files are removed after 5 days. We still don't understand why. It is not the tmpwatch.
  • UZH (reports Sergio):
    • Xxx
  • Switch (reports Alessandro):
    • Xxx
Other topics
  • Topic1
  • Topic2
Next meeting date:

AOB

Attendants

  • CSCS:
  • CMS: Daniel
  • ATLAS:
  • LHCb:
  • EGI:

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r10 | r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r5 - 2013-08-08 - SzymonGadomski
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback