Tags: view all tags

Swiss Grid Operations Meeting on 2013-08-08

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9227296)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=Nrq24qRR4V1u
Phone gate: From Switzerland: 0225330322 (portal) + 9227296 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Agenda

Status

CSCS (reports George):
- Storage01 root partition became full on the 29th July resulting in failed transfers. A full reboot of the machine was required to get it back into service.
  - Restarting the dCache services resulted in the used space reported by the operating system to return to the normal ~20%
  - The next day we noticed disk space was still growing caused by dCache creating trace files under /tmp
    - Altering the log levels within the dCache CLI/ pcells didn't have any affect.
    - For the time being we are log rotating this files
    - The relavent line in the logback.xml has been changed so when dcache is next restarted these files will not be created.
- We have just provisioned slurm1, slurm2 and cream04 to begin testing Slurm/ EMI3
- Achieved a 99% availability and 100% reliability in the tier 2 report for July http://sam-reports.web.cern.ch/sam-reports/2013/201307/wlcg/WLCG_Tier2_OPS_Jul2013.pdf

PSI (reports Fabio):
- Good news:
  - I've physically installed our new NetApp E5460 360TB raw; If you never have seen a NetApp E5460 then watch this youtube.
  - The 12 RAID6 creation ( took ~4 days ), VolGroups, Vols and FC Hosts definition were automatically created by simply loading the configuration of our other E5460 ( I saved a lot of time ! ).
  - I'm preparing an SL6 + RDAC + FC installation to stress the E5460 before to merge it into our production environments.
  - NetApp support was unresponsive and really remote; it's based in India, so answers sent in their Timezone; it took days to get a not guest account, download Santricity and activate/map to my ID the E5460 serial number.
  - I did not have time to try but the NetApp AutoSupport service to be remotely monitored by NetApp looks nice. Update: the Netapp E Series has NOT the AutoSupport, bad luck.
- Bad news: Our T3 is partially down since 5th Aug ! returning from a Sunday in Italy it failed:
  - The Milano -> Zurich train, I'm arrived @ home 3am !
  - On 5th Aug, a Solaris dCache server X4540 t3fs07 got frozen, I rebooted it and Solaris 10 could not boot because the Flash Card was failed + 3 broken SATA disks in the server such that 2 of them in the same raidz2 + again 1 of them producing an endless Disconnected command timeout for Target 1.
  - In the meantime 2 disks were failing in an other server X4540 t3fs11, again in the same raidz2 ..
  - To complete these crashes, I got an other disk failed in an other server X4540 t3fs10, but that was easy to manage.
  - To fix I've:
  - Installed Solaris Express 11 on a new Flash Card, precious inheritance coming from a CSCS X4540 and booted the new t3fs07: after a couple of reboots this Flash Card failed too and I had to reinstall Solaris Express 11 on our last Flash Card.
  - Stopped dCache on t3fs11 to avoid an additional load, and because ZFS was already in rebuilding using 2 spares I let it complete and once it was rebuilt I've promoted the 2 spares as 2 new pool disks and the 2 previous pool disks as 2 new spares: generally speaking this avoids too many rebuildings, so it helps to preserve your disks, but it permutes the raidz2 configurations of our 5 X4540, that raises the cluster complexity.
  - On 9th Aug I got an other 2 broken disks, 1 inside t3fs11 and 1 inside t3fs08.

UNIBE (reports Gianfranco - please note: I will not attend the meeting):
- ce.lhep cluster (older CentOS 5) upgraded to ARC 3.0.2
  - obscure bug causing slapd not to start with no trace of error (unless you increase log level to 256 (!) and instruct syslog to turn the infosys log on) Actual bug is here: https://bugzilla.nordugrid.org/show_bug.cgi?id=3226
  - infoprovider does not work if stale job files (typically from previously failed jobs) exists in the controldir. After some long debugging, understood the problem and Andrej provided a cleanup script.
- ce.lhep cluster will grow ~4x in size with nodes from the HLT farm in CERN. At the same time (Sept), it will be re-installed with SLC6
- ce01.lhep cluster (newer SLC6): problems reported previously not permanently solved yet (issue wuth Thumpers lockup, CVMFS cache becoming full). Still ready to move lustre to IB, but the transition will require both me and Andrej to be around for some days at least (which has not happened since last month's meeting)
UNIGE (reports Szymon):
- First batch worker node running SLC6 put in operation. Most jobs are OK.
- Hardware procurement for the 2013 upgrade is under way.
  - Replace Solaris in DPM (six machines, 96 TB net)
  - with new hardware (IBM x3630 M4, 4 machines, 172 TB total).
- CernVM file system was set up. We use NFS do deploy it.
- Adaptation of operational procedures, especially the cleanup of "dark data", to the new version of the ATLAS Distributed Data Management software "Rucio".
- Hardware failures:
  - A few disk failures on IBM x3630 M3 and Sun X4500+ machines
  - Memory errors on one Sun X4540 disk server
  - Failure of hardware raid on one IBM x3630, due to overheating
- Automatic cleanup of /tmp is affecting very long jobs. Files are removed after 5 days. We still don't understand why. It is not the tmpwatch.
UZH (reports Sergio):
- Xxx
Switch (reports Alessandro):
- Xxx