Tags: view all tags

Swiss Grid Operations Meeting on 2015-11-10

Time: 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2015-11-10
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

Systems:

HP Smart array issues (config loss and no boot), lost a lot of time with the HP support. Self solution found: Disable smart array and enable legacy mode for the boot disk.
Prolonged IB Bridges warranty until spring 2016
Requested new certificates for argus* with correct DNS AltName
LHCb job are still not running well, we suggested to Vladimir to use the right runtime env (env/proxy and glite), but still no changes.
CMS is testing multicore jobs
Working hard to finalize arc02 puppet cofiguration.
We are planning to dismiss cream04
Planning the upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) almost all services on the cluster are affected.
Getting offers for the Phoenix expansion

Storage:

Scratch - GPFS: Netapp storage firmware upgrade (no service interruption).
dCache:
- We still have the cleaner problem, mainly with CMS. At the moment the cleaner needs to be executed manually but the situation has been stabilized after some big deletions from CMS.
- This week we should finalise the configuration of a pre-production system where we will test the 2.6 -> 2.10 (2.13) upgrade in order to be able to upgrade the production by the end of this month.

PSI

FYI, 1TB OwnCloud/EOS @ CERN : http://cernbox.web.cern.ch/

NFSv4

Context : MeetingSwissGridOperations20151015#PSI

Eventually I made a RAID10 with 24disks, no spare : More... Close

Mon Nov 9 17:44:37 CET 2015 capacity operations bandwidth pool alloc free read write read write ---------------------------------- ----- ----- ----- ----- ----- ----- data01 4.51T 2.00T 7.43K 17.6K 590M 1.21G mirror 385G 171G 734 1.43K 50.5M 100M pci-0000:08:00.0-scsi-0:2:0:0 - - 331 978 23.8M 100M pci-0000:08:00.0-scsi-0:2:1:0 - - 337 980 27.1M 101M mirror 385G 171G 660 1.48K 49.9M 99.5M pci-0000:08:00.0-scsi-0:2:2:0 - - 312 888 26.1M 99.8M pci-0000:08:00.0-scsi-0:2:3:0 - - 293 910 24.0M 103M mirror 385G 171G 587 1.52K 48.9M 106M pci-0000:08:00.0-scsi-0:2:4:0 - - 251 940 22.8M 106M pci-0000:08:00.0-scsi-0:2:5:0 - - 290 939 26.3M 106M mirror 385G 171G 608 1.46K 49.2M 106M pci-0000:08:00.0-scsi-0:2:6:0 - - 274 949 24.8M 108M pci-0000:08:00.0-scsi-0:2:7:0 - - 284 933 24.7M 106M mirror 385G 171G 583 1.35K 49.0M 103M pci-0000:08:00.0-scsi-0:2:8:0 - - 264 917 24.4M 104M pci-0000:08:00.0-scsi-0:2:9:0 - - 267 908 24.8M 103M mirror 385G 171G 566 1.49K 46.5M 107M pci-0000:08:00.0-scsi-0:2:10:0 - - 255 952 23.5M 108M pci-0000:08:00.0-scsi-0:2:11:0 - - 260 944 23.2M 107M mirror 385G 171G 607 1.56K 49.9M 106M pci-0000:08:00.0-scsi-0:2:12:0 - - 277 942 24.1M 106M pci-0000:08:00.0-scsi-0:2:13:0 - - 265 953 25.9M 108M mirror 385G 171G 638 1.43K 49.0M 102M pci-0000:08:00.0-scsi-0:2:14:0 - - 281 960 23.3M 102M pci-0000:08:00.0-scsi-0:2:15:0 - - 288 978 26.0M 105M mirror 385G 171G 622 1.48K 48.9M 105M pci-0000:08:00.0-scsi-0:2:16:0 - - 277 1003 24.1M 109M pci-0000:08:00.0-scsi-0:2:17:0 - - 296 975 25.1M 105M mirror 385G 171G 672 1.48K 50.6M 102M pci-0000:08:00.0-scsi-0:2:18:0 - - 296 926 24.2M 105M pci-0000:08:00.0-scsi-0:2:19:0 - - 324 911 26.7M 103M mirror 385G 171G 670 1.50K 49.7M 103M pci-0000:08:00.0-scsi-0:2:20:0 - - 302 935 24.6M 106M pci-0000:08:00.0-scsi-0:2:21:0 - - 309 915 25.4M 103M mirror 385G 171G 658 1.39K 47.6M 105M pci-0000:08:00.0-scsi-0:2:22:0 - - 305 980 24.0M 107M pci-0000:08:00.0-scsi-0:2:23:0 - - 288 965 24.0M 105M ---------------------------------- ----- ----- ----- ----- ----- -----

Instead of a single ZFS filesystem I made a hierarchy of filesystems, as advised by Oracle : More... Close

# zfs list NAME USED AVAIL REFER MOUNTPOINT data01 4.41T 1.90T 96K /zfs/data01 data01/shome 4.06T 1.90T 120K /zfs/data01/shome data01/shome/amarini 42.0G 1.90T 42.0G /zfs/data01/shome/amarini data01/shome/bbilin 120K 1.90T 120K /zfs/data01/shome/bbilin data01/shome/bianchi 67.7G 1.90T 67.7G /zfs/data01/shome/bianchi data01/shome/casal 77.1G 1.90T 77.1G /zfs/data01/shome/casal ...

By setting properties on the root of the hierarchy they'll get propagated to each child: More... Close

# zfs get all data01/shome NAME PROPERTY VALUE SOURCE data01/shome type filesystem - data01/shome creation Sun Nov 8 18:06 2015 - data01/shome used 4.06T - data01/shome available 1.90T - data01/shome referenced 120K - data01/shome compressratio 1.30x - data01/shome mounted yes - data01/shome quota none default data01/shome reservation none default data01/shome recordsize 128K default data01/shome mountpoint /zfs/data01/shome inherited from data01 data01/shome sharenfs rw=@192.33.123.0/24,sync,root_squash,no_all_squash,no_subtree_check,crossmnt,sec=sys local data01/shome checksum on default data01/shome compression lz4 inherited from data01 data01/shome atime on default data01/shome devices on default data01/shome exec on default data01/shome setuid on default data01/shome readonly off default data01/shome zoned off default data01/shome snapdir hidden inherited from data01 data01/shome aclinherit restricted default data01/shome canmount on default data01/shome xattr on default data01/shome copies 1 default data01/shome version 5 - data01/shome utf8only off - data01/shome normalization none - data01/shome casesensitivity sensitive - data01/shome vscan off default data01/shome nbmand off default data01/shome sharesmb off default data01/shome refquota none default data01/shome refreservation none default data01/shome primarycache all default data01/shome secondarycache all default data01/shome usedbysnapshots 0 - data01/shome usedbydataset 120K - data01/shome usedbychildren 4.06T - data01/shome usedbyrefreservation 0 - data01/shome logbias latency default data01/shome dedup off default data01/shome mlslabel none default data01/shome sync standard inherited from data01 data01/shome refcompressratio 1.00x - data01/shome written 120K - data01/shome logicalused 5.24T - data01/shome logicalreferenced 52K - data01/shome filesystem_limit none default data01/shome snapshot_limit none default data01/shome filesystem_count 60 local data01/shome snapshot_count 0 local data01/shome snapdev hidden default data01/shome acltype off default data01/shome context none default data01/shome fscontext none default data01/shome defcontext none default data01/shome rootcontext none default data01/shome relatime on temporary data01/shome redundant_metadata all default data01/shome overlay off default

Taking a recursive snapshot of the root of the hierarchy will take a snapshot of each child, at the same time.
Taking snapshots ( but without giving the destroy permission ) can be delegated to each user on their own filesystem and also managed by simply NFSv4 mkdir commands ! Oracle Ref ; it needs a tweaking on ZFS on Linux
```
The use of mkdir/rmdir/mv in the .zfs/snapshot directory has been disabled by default both locally and via NFS clients. The zfs_admin_snapshot module option can be used to re-enable this functionality. 
```
work in progress..

To CSCS, at PSI I've tuned this dCache Xrootd threshold xrootd.limits.threads=160 ; default is 1000 that was too high for us ; we were recurrently getting 1000 Xrootd sessions from Internet that eventually expired with a timeout.
Processed the EGI SVG Advisory - 'Critical' risk. Remote arbitrary code execution vulnerabilities in the core crypto library used by RedHat.

UNIBE-LHEP

Operations
- Smooth(-ish) operation on ce02, quite stable at just over 500 cores. Nothing of relevance to report
- Re-deployment of the ce01 cluster under way:
  - SLC 6.7 and ARC 5.0.3 (needed a downgrade of opeldap* to have a functional resource bdii on the ARC CE)
  - about 900 worker-cores installed
  - new lustre (version 2.5.3, 200 disks), Thumpers decommissioned
  - moved to stlurm, cutting my teeth on it.
  - hope to go online in the next few hours
- Patching against CVE-2015-7183 (nss*, nspr* from slc6-testing)
ATLAS specific operations
- Implementing the requested monthly dumps of the namespace on the DPM SE.

UNIBE-ID

Commissioning
- Ordered the first 32 new compute nodes (Broadwell) with a total of 640 cores; delivered in 12/2015
- Another 32 nodes will get ordered early in 2016
Operations
- Prolonged maintenance down due to painful migration to the new GPFS storage
  - Lesson learned (us + IBM techie!): Using AFM and additonally doing rsyncs is a huge no go and leads to a corrupted filesystem when disabling AFM in the end
  - though no data loss
- Since then smooth operation again
- Upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) done tomorrow within the already setup maintenance down
ATLAS specific operations
- no problems
- ordered new SSL certificate for nordugrid.unibe.ch due to STRICT_RFC2818 switch by Globus GSI clients