Tags: view all tags

Swiss Grid Operations Meeting on 2013-10-31

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9227296)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=Nrq24qRR4V1u
Phone gate: From Switzerland: +41227671400 (portal) + 9227296 (extension) + # (pound sign) (for more details see the CERN info page)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Agenda

Status

CSCS (reports Miguel):
- Report of the situation of the scratch filesystem: After this week's events, the system seems to be stable.
  1. We have cleaned more than 50 million inodes and deployed new GPFS policies to prevent this from happening again.
```
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/gpfs            150626304 51891413 98734891   35% /gpfs
virident1           293937152     1005 Yes      No         25172992 (  9%)      40936480 (14%)
virident2           293937152     1006 Yes      No         25139200 (  9%)      40991648 (14%)
ssd1                390711360     1007 Yes      No        335058944 ( 86%)       6415328 ( 2%)
ssd2                390711360     1008 Yes      No        335084544 ( 86%)       6396320 ( 2%)
```
  2. Also new cp_1.sh (prolog) and cp_3.sh (epilog) files have been deployed to make jobs run under /tmpdir_slurm/$CREAMCENAME/$JOBID instead of under $HOME.
```
#!/bin/bash
# cp_1.sh: this is _sourced_ by the CREAM job wrapper
export TMPDIR="/gpfs/tmpdir_slurm/${SLURM_SUBMIT_HOST}/${SLURM_JOB_ID}"
export MYJOBDIR=${TMPDIR}
mkdir -p ${TMPDIR}
cd ${TMPDIR}

#! /bin/bash
# cp_3.sh: this is _sourced_ by the CREAM job wrapper
# This gets executed at the end
rmdir ${MYJOBDIR}
```
  3. This scratch filesystem runs on very old hardware (~4yr) and needs to be decommissioned ASAP. We are working on finding a good solution in terms of performance/price.
- SLURM migration status:
  1. EMI-3 CREAM and ARC CEs work fine and have been accepting/running jobs for a while. No major issues found (except for some mismatches with the information system).
  2. UMD-2 CREAM and ARC CEs are in downtime and will be migrated to EMI-3 next week.
  3. WNs: 9 nodes remain on Torque/pbs (UMD-2). By end of next week all will be migrated to SLURM (EMI-3).
  4. Accounting (APEL). Old UMD-2 APEL shutdown, new one fetching data from EMI-3 CREAM-CEs. Still waiting for APEL people (John Gordon) to give us the green light to publish to the new APEL system. Accounting from SLURM in October may be lost due to official APEL migration process not clear (not working!). ARC to publish directly without passing thru the APEL server at CSCS.
  5. BDII: still running UMD-2. Need to be upgraded to EMI-3 to solve issues with glue2 publication. Will be done ASAP.
- CVMFS upgraded to 2.1.15
- dCache migration:
  - Testing 2.6 in pre-production - So far it seems to be a fairly simple migration with minimal changes
  - We will also use this opertunity to upgrade to Postgres 9.3
- Working on new monitoring system: http://ganglia.lcg.cscs.ch/ganglia3/ (views for GPFS/dCache IO stats)
PSI (reports Fabio):
- Collected HW offers to request funds for the next 2 years. Basically we will need more:
  - WNs
  - a new NetApp
  - a couple of 2u Oracle NAS to relocate onto new HW the current replicated and shared /shome ( today we use 2 * SUN Thumper )
    - We might decide to build the /shome as a bunch of 10*2T disks stored inside the future new NetApp and SAS connected, MPxIO managed to an Active/Passive couple of 1u Oracle; the Active node will format the 10*2TB disks as a ZFS, and it will offer the resulting filesystem by NFSv3; if the Active node crashes then I simply have to import the ZFS into the Passive node, and switch on NFSv3. This conf is cheaper than having a couple of 2u NAS replicated ( less disks, cheaper nodes ).
  - maybe a 32 * 10Gbit/s ports Cisco N7K – Extender; not urgent but at a certain point 10Gbit/s Ethernet will become the default.
- Updated the CMS Frontier Squid to the latest version
- Scheduled downtime on 8th Nov to upgrade both to dCache 2.6, and to UMD3 our SL5 UIs; not sure about the SL5 WNs because of the lack of the new tarball
- dCache 2.6 upgrade
  - This can be also useful for CSCS if confirmed: with dCache 2.6 you can avoid to have a separate gPlazma1 configuration for the Xrootd door and simply use the common gPlazma2 cell. I wrote my conf here. I asked for a confirmation.
  - Since Spring we're trying to partition the T3 users in primary groups rapresenting the disjoint T3 subgroups: this allows us to easily compute the /pnfs group space usage by running a simple query vs Chimera DB; profiting from a such partitioning one can also produce Ganglia plots and similar; I hope to introduce this change on 8th Nov. In the future CSCS could consider to adopt the same partitioning, being conscious that this raises both complexity and security.
  - The Xrootd dCache pool plugin was incompatible with dCache 2.6.11, I asked to produce a compatible one
UNIBE (reports Gianfranco):
- ce01.lhep cluster
  - Still stable after moving Lustre to the Infiniband layer. Still one occurrance of Thumper NIC/PCI lockup a couple of weeks back. Lustre now under heavier load, working well so far. Will try to commission for analysis soon.
  - Latest version of CVMFS is said to cure the cache full issue we suffer from. Will updgrade soon
  - NFS v4 new user mapping defaults broke file ownership on a NFS share on the ce01.lhep cluster. This prevented the SW validation jobs from writing the SW tags to the shared area. The fix consists in explicitely declare the Domain in /etc/idmapd.conf
- c202.lhep cluster
  - All ROCKS images (MDS, OSS, WN) ready, mass installation under way
  - ARC not yet installed. Might hope tu run test jobs tomorrow
UNIGE (reports Szymon):
- Xxx
UZH (reports Sergio):
- Xxx
Switch (reports Alessandro):
- Xxx