Tags: view all tags

Swiss Grid Operations Meeting on 2014-01-17

Date and time: Friday Jan 17, 2014 at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9227296)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=Nrq24qRR4V1u
Phone gate: From Switzerland: 0225330322 (portal) + 9227296 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Agenda

Status

CSCS (reports Miguel):
- Everything on EMI-3 CREAM/ ARC/ BDII
- Now running dCache 2.6 with postgres 9.3
- /pnfs changes
  - Read/ Write access now permitted
  - LHCb had issues with lcg-getturls and using file as the first prefered protocol. In short jobs though /pnfs was available local on WN. Resolved with dCache config update "loginBroker=srm-LoginBroker" under NFS domain.
- Admin cell in dCache had issues earlier this week causing metrics not to be updated. Still looking into root cause, failure to close socket errors in log.
- Began trail of node wn68 without /experiment_software/ mounts, so far so good.
PSI (reports Fabio):
- Could not upgrade our SL5 WNs from the UMD2 Tarball to the UMD3 Tarball; the maintainer told us that he's going to release the new tarball during Jan. Let's see.
- Migrated from Subversion to GIT ( Puppet recipes, configuration files, site scripts, ... ).
- Upgraded Nagios from ver. 3.1 to latest 4.0.2 ; much faster.
- Upgraded Puppet from ver. 0.25 to 2.7.20, both Linux and Solaris.
- Upgraded Postgresql from ver 9.2 to 9.3 ; I especially upgraded it to use the PG 9.3 materialized views in dCache. To upgrade you have to dump the PG 9.2 DBs and ingest them into PG 9.3.
  - I materialized the view v_pnfs in dCache ; it's now much faster to extract meta-data statistics about /pnfs; be aware that you have to regularly refresh the materialized views; my code is on bitbucket.
- Upgraded dCache to the latest ver. 2.6.19
  - Regrettably the access time of a /pnfs file still doesn't get updated.
  - The Linux pools could not start with the RPM dcache26-plugin-xrootd-monitor-5.0.0-2, I removed it; during our previous dCache upgrade we got the same issue.
  - Validated once more this CMS dCache/Xrootd configuration ; since 13th Jan PSI offers the xrootd service based on that; I invite CSCS to consider that conf.
  - Since the 2.6.19 upgrade the dCache pool daemon constantly keeps opened > 4k files ; we got more than a diskCacheV111.util.DiskErrorCacheException: File could not be opened [/mnt/data10/t3fs13_cms/pool/data/00007E804B4380294466B052E0D6BDB0F454 (Too many open files)]; please check the file system ; we incresead the nofile ulimits up to 6000 for the dcache user; I've written a Nagios Python check to monitor if 6000 is enough, so far it seems so.
UNIBE (reports Gianfranco):
- ce02 cluster with ~800 cores fully commissioned shortly after our last meeting, with ARC 3.0.3-1.el6.x86_64 (same as ce01)
- Lustre has stabilised, cvmfs upgrade to 2.1.15-1 has finally removed the long standing issue of cache overflowing the partition
- One more lustre oss (thumper) crashed/out of service (2 dead nodes in 6 months since commissioning IB, used to be 2 a week prior to that). The node seems to have spontaneously re-booted, it is stuck at the SuperMicro splash screen and does not respond to keyboard. Lustre recovered by disabling the node as lustre server everywhere (jobs using files on that node hang and eventually fail). Added to lustre the last spare thumper we had. No more spares now, so a new failure will mean total re-install of all (due to the way ROCKS handles RPM versioning, always installing the latest version)
- More than 4 full weeks of stable operations, almost unattended on both ce01 and ce02 clusters
- Both clusters commissioned for analysys and Athena multi-core. Also running the so-called 'hi-prio', high pile-up reconstruction tasks normally handled by T1s (in the non-ARC world)
- VO t2k.org fully commissioned on both clusters and the DPM SE (quite active)
- DPM SE re-configured for WEB_DAV (ATLAS requirement in order to switch from dq2 to Rucio DDM tools). Also started configuring for xrootd in order to join the DE xrootd federation (FAX) along with CSCS and UniGE. This has required upgrading the middleware to the latest versions available in EMI-3 and epel on the head node and all the disk servers (1.8.7-3). Upgrade laborious but painless.
- NOTE about DPM: YAIM maintained until August 2014. PHAsed out by ~end of year in favour of Puppet
  - https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Admin/InstallationConfigurationPuppetSimple
- Migration to Rucio name convention performed by central ATLAS DDM in December
- Getting quotes for new purchases for: Lustre storage and WN's
- ARC Accounting:
  - working on commissioning the Jura-to-Apel publisher. In principle ready to send job records to an APEL test server. Meanwhile still publishing to the Swiss SGAS instance at SWITCH, which will be decommissioned soon
  - discovered a permission problem for ce02 on the SGAS server that has prevented pushing job records to it. Fixed and currently pushing >100k records, to be then pushed to APEL in backdated mode
- Migration of other NGI_CH services from SWITCH to Unibe ongoing
  - Installed and configure da VOMS server and a GIIS server
  - VOMS undert testing, GIIS configuration to be finalised
  - Plan a smooth transition over the next few weeks
UNIGE (reports Szymon):
- Upgrade of the SE
  - 4 new disk servers (IBM x3630 M4, 43 TB for data) running SLC6 and added to the DPM
  - 6 old Solaris disk servers drained from data and retired (2 will be reused for NFS)
  - software upgraded to 1.8.7 on all machines (15 disk servers and the head node)
  - webdav and xrootd interfaces added
- Reorganization of data in the SE for the new ATLAS DDM 'Rucio'
  - renaming process run by Cedric Serfon for DDM ops using webdav
  - two failed attempts in Dec 2013 with 'Too many connections' errors
  - success in Jan 2014, with local jobs not running
  - issues with /var filling up for cause unknown (df says full, du sees 50%) during the renaming; must be a DPM "feature"
- Preparing a funding request
  - new disk servers to run NFS user home and software space (Sun X4-2)
  - new machines to be hosts for critical VMs (IBM 3550 M4)
  - one data storage beast (IBM 3630 M4)
- Yearly review of accounts
- One hardware problem in Dec - raid overheating in an IBM x3630 M3
- Two scheduled down times for work on electricity in the machine room
- Change of nearly all IP numbers
- Pending: SLC6 migration and adding the ex ATLAS Trigger machines
UZH (reports Sergio):
- Xxx
Switch (reports Alessandro):
- Xxx