Swiss Grid Operations Meeting on 2015-11-10

Site status

CSCS

Systems:
  • HP Smart array issues (config loss and no boot), lost a lot of time with the HP support. Self solution found: Disable smart array and enable legacy mode for the boot disk.
  • Prolonged IB Bridges warranty until spring 2016
  • Requested new certificates for argus* with correct DNS AltName
  • LHCb job are still not running well, we suggested to Vladimir to use the right runtime env (env/proxy and glite), but still no changes.
  • CMS is testing multicore jobs
  • Working hard to finalize arc02 puppet cofiguration.
  • We are planning to dismiss cream04
  • Planning the upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) almost all services on the cluster are affected.
  • Getting offers for the Phoenix expansion
Storage:
  • Scratch - GPFS: Netapp storage firmware upgrade (no service interruption).
  • dCache:
    • We still have the cleaner problem, mainly with CMS. At the moment the cleaner needs to be executed manually but the situation has been stabilized after some big deletions from CMS.
    • This week we should finalise the configuration of a pre-production system where we will test the 2.6 -> 2.10 (2.13) upgrade in order to be able to upgrade the production by the end of this month.

PSI

  • FYI, 1TB OwnCloud/EOS @ CERN : http://cernbox.web.cern.ch/
  • NFSv4
    • Context : MeetingSwissGridOperations20151015#PSI
    • Eventually I made a RAID10 with 24disks, no spare : More... Close
      Mon Nov 9 17:44:37 CET 2015 capacity operations bandwidth pool alloc free read write read write ---------------------------------- ----- ----- ----- ----- ----- ----- data01 4.51T 2.00T 7.43K 17.6K 590M 1.21G mirror 385G 171G 734 1.43K 50.5M 100M pci-0000:08:00.0-scsi-0:2:0:0 - - 331 978 23.8M 100M pci-0000:08:00.0-scsi-0:2:1:0 - - 337 980 27.1M 101M mirror 385G 171G 660 1.48K 49.9M 99.5M pci-0000:08:00.0-scsi-0:2:2:0 - - 312 888 26.1M 99.8M pci-0000:08:00.0-scsi-0:2:3:0 - - 293 910 24.0M 103M mirror 385G 171G 587 1.52K 48.9M 106M pci-0000:08:00.0-scsi-0:2:4:0 - - 251 940 22.8M 106M pci-0000:08:00.0-scsi-0:2:5:0 - - 290 939 26.3M 106M mirror 385G 171G 608 1.46K 49.2M 106M pci-0000:08:00.0-scsi-0:2:6:0 - - 274 949 24.8M 108M pci-0000:08:00.0-scsi-0:2:7:0 - - 284 933 24.7M 106M mirror 385G 171G 583 1.35K 49.0M 103M pci-0000:08:00.0-scsi-0:2:8:0 - - 264 917 24.4M 104M pci-0000:08:00.0-scsi-0:2:9:0 - - 267 908 24.8M 103M mirror 385G 171G 566 1.49K 46.5M 107M pci-0000:08:00.0-scsi-0:2:10:0 - - 255 952 23.5M 108M pci-0000:08:00.0-scsi-0:2:11:0 - - 260 944 23.2M 107M mirror 385G 171G 607 1.56K 49.9M 106M pci-0000:08:00.0-scsi-0:2:12:0 - - 277 942 24.1M 106M pci-0000:08:00.0-scsi-0:2:13:0 - - 265 953 25.9M 108M mirror 385G 171G 638 1.43K 49.0M 102M pci-0000:08:00.0-scsi-0:2:14:0 - - 281 960 23.3M 102M pci-0000:08:00.0-scsi-0:2:15:0 - - 288 978 26.0M 105M mirror 385G 171G 622 1.48K 48.9M 105M pci-0000:08:00.0-scsi-0:2:16:0 - - 277 1003 24.1M 109M pci-0000:08:00.0-scsi-0:2:17:0 - - 296 975 25.1M 105M mirror 385G 171G 672 1.48K 50.6M 102M pci-0000:08:00.0-scsi-0:2:18:0 - - 296 926 24.2M 105M pci-0000:08:00.0-scsi-0:2:19:0 - - 324 911 26.7M 103M mirror 385G 171G 670 1.50K 49.7M 103M pci-0000:08:00.0-scsi-0:2:20:0 - - 302 935 24.6M 106M pci-0000:08:00.0-scsi-0:2:21:0 - - 309 915 25.4M 103M mirror 385G 171G 658 1.39K 47.6M 105M pci-0000:08:00.0-scsi-0:2:22:0 - - 305 980 24.0M 107M pci-0000:08:00.0-scsi-0:2:23:0 - - 288 965 24.0M 105M ---------------------------------- ----- ----- ----- ----- ----- ----- 
    • Instead of a single ZFS filesystem I made a hierarchy of filesystems, as advised by Oracle : More... Close
      # zfs list NAME USED AVAIL REFER MOUNTPOINT data01 4.41T 1.90T 96K /zfs/data01 data01/shome 4.06T 1.90T 120K /zfs/data01/shome data01/shome/amarini 42.0G 1.90T 42.0G /zfs/data01/shome/amarini data01/shome/bbilin 120K 1.90T 120K /zfs/data01/shome/bbilin data01/shome/bianchi 67.7G 1.90T 67.7G /zfs/data01/shome/bianchi data01/shome/casal 77.1G 1.90T 77.1G /zfs/data01/shome/casal ... 
    • By setting properties on the root of the hierarchy they'll get propagated to each child: More... Close
      # zfs get all data01/shome NAME PROPERTY VALUE SOURCE data01/shome type filesystem - data01/shome creation Sun Nov 8 18:06 2015 - data01/shome used 4.06T - data01/shome available 1.90T - data01/shome referenced 120K - data01/shome compressratio 1.30x - data01/shome mounted yes - data01/shome quota none default data01/shome reservation none default data01/shome recordsize 128K default data01/shome mountpoint /zfs/data01/shome inherited from data01 data01/shome sharenfs rw=@192.33.123.0/24,sync,root_squash,no_all_squash,no_subtree_check,crossmnt,sec=sys local data01/shome checksum on default data01/shome compression lz4 inherited from data01 data01/shome atime on default data01/shome devices on default data01/shome exec on default data01/shome setuid on default data01/shome readonly off default data01/shome zoned off default data01/shome snapdir hidden inherited from data01 data01/shome aclinherit restricted default data01/shome canmount on default data01/shome xattr on default data01/shome copies 1 default data01/shome version 5 - data01/shome utf8only off - data01/shome normalization none - data01/shome casesensitivity sensitive - data01/shome vscan off default data01/shome nbmand off default data01/shome sharesmb off default data01/shome refquota none default data01/shome refreservation none default data01/shome primarycache all default data01/shome secondarycache all default data01/shome usedbysnapshots 0 - data01/shome usedbydataset 120K - data01/shome usedbychildren 4.06T - data01/shome usedbyrefreservation 0 - data01/shome logbias latency default data01/shome dedup off default data01/shome mlslabel none default data01/shome sync standard inherited from data01 data01/shome refcompressratio 1.00x - data01/shome written 120K - data01/shome logicalused 5.24T - data01/shome logicalreferenced 52K - data01/shome filesystem_limit none default data01/shome snapshot_limit none default data01/shome filesystem_count 60 local data01/shome snapshot_count 0 local data01/shome snapdev hidden default data01/shome acltype off default data01/shome context none default data01/shome fscontext none default data01/shome defcontext none default data01/shome rootcontext none default data01/shome relatime on temporary data01/shome redundant_metadata all default data01/shome overlay off default 
    • Taking a recursive snapshot of the root of the hierarchy will take a snapshot of each child, at the same time.
    • Taking snapshots ( but without giving the destroy permission ) can be delegated to each user on their own filesystem and also managed by simply NFSv4 mkdir commands ! Oracle Ref ; it needs a tweaking on ZFS on Linux
      The use of mkdir/rmdir/mv in the .zfs/snapshot directory has been disabled by default both locally and via NFS clients. The zfs_admin_snapshot module option can be used to re-enable this functionality. 
    • work in progress..
  • To CSCS, at PSI I've tuned this dCache Xrootd threshold xrootd.limits.threads=160 ; default is 1000 that was too high for us ; we were recurrently getting 1000 Xrootd sessions from Internet that eventually expired with a timeout.
  • Processed the EGI SVG Advisory - 'Critical' risk. Remote arbitrary code execution vulnerabilities in the core crypto library used by RedHat.

UNIBE-LHEP

  • Operations
    • Smooth(-ish) operation on ce02, quite stable at just over 500 cores. Nothing of relevance to report
    • Re-deployment of the ce01 cluster under way:
      • SLC 6.7 and ARC 5.0.3 (needed a downgrade of opeldap* to have a functional resource bdii on the ARC CE)
      • about 900 worker-cores installed
      • new lustre (version 2.5.3, 200 disks), Thumpers decommissioned
      • moved to stlurm, cutting my teeth on it.
      • hope to go online in the next few hours
    • Patching against CVE-2015-7183 (nss*, nspr* from slc6-testing)
  • ATLAS specific operations
    • Implementing the requested monthly dumps of the namespace on the DPM SE.

UNIBE-ID

  • Commissioning
    • Ordered the first 32 new compute nodes (Broadwell) with a total of 640 cores; delivered in 12/2015
    • Another 32 nodes will get ordered early in 2016
  • Operations
    • Prolonged maintenance down due to painful migration to the new GPFS storage
      • Lesson learned (us + IBM techie!): Using AFM and additonally doing rsyncs is a huge no go and leads to a corrupted filesystem when disabling AFM in the end
      • though no data loss
    • Since then smooth operation again
    • Upgrade of libnss (Advisory-SVG-2015-CVE-2015-7183) done tomorrow within the already setup maintenance down
  • ATLAS specific operations
    • no problems
    • ordered new SSL certificate for nordugrid.unibe.ch due to STRICT_RFC2818 switch by Globus GSI clients

UNIGE

  • Xxx

NGI_CH

Other topics

  • Daniel being replaced as CMS contact person
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS: Pablo, Dario, Dino, Gianni
  • CMS: Fabio Martinelli, Daniel Meister
  • ATLAS: Gianfranco
  • LHCb: Roland Bernet
  • EGI: Gianfranco

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r19 | r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r14 - 2015-11-10 - DinoConciatore
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback