Tags: view all tags

Swiss Grid Operations Meeting on 2016-04-07 at 15:30

Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2016-04-07 at 15:30
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

New ARC CE instance (arc03) installed along with a new SLURM instance (15.08.8) and all the recently purchased WNs
(this cluster is integrated into CSCS LDAP and central SLURM DB)
certificates mess last week (Gianni's fault!): thanks to Gianfranco and Sigve for their help
some time spent fixing the Information System (GGUS 118922)
tentative planned maintenance on 20160503 to replace IB/Eth bridges, moving some VMs, reinstalling arc02
CREAM CEs to be dismissed by beginning of June
Accounting numbers (from scheduler) from last month

GPFS

No issues to report
Metadata from local SSD to FC Flash migration should be performed on May 3rd

dCache

Almost ready to deploy the first 500TB of new storage (from NETAPP 5560)
The additional 500TB will be ready by the first part of May (from SFA12K)
Investigating some "unexpected" files deletion (CMS)

PSI

Put in production the new CentOS7/ZFS/NFSv4 /homes hierarchy
Installing 9 new Dalco servers ( got 2 disks dead on arrival ) ; each :
- Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 64 cores ( HT on )
- 128GB RAM
- 4 disks 900GB 10k SAS in mdadm 1+0 by Kickstart
- made a 100GB partition formatted as XFS in order to test FS-Cache + NFSv4
Accounting numbers (from scheduler) from last month

UNIBE-LHEP

Operations

mostly stable operation on both systems, except for:
some random failures on some ce01 nodes ( trans: Transformation not installed in CE)
- leads to flipping between black and white-listing by HC
- usually a cvmfs related problem, but cvmfs reports fine on all nodes
- under investigation right now
eth0 dropped twice within 12h on the ce01 lustre mds:

Mar 31 08:26:14 mds-2-1 kernel: irq 75: nobody cared (try booting with the "irqpoll" option)
...
Mar 31 08:26:31 mds-2-1 kernel: e1000e 0000:03:00.0: eth0: Reset adapter unexpectedly

leaves lustre hanging, needs power-cycling to recover (lustre come back is quick)
maybe flacky h/w, getting a spare card to plug in case or recurrance

ATLAS specific operations

HC online 33% (last month, single core only - not huge impact since over 80% of work is MCORE):

http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE

63% of ATLAS/CH WT, 70% CPUtime in March:

http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites[]=CSCS-LCG2&sites[]=UNIBE-LHEP&sitesCat[]=CH-CHIPP-CSCS&resourcetype=All&sitesSort=2&sitesCatSort=2&start=2016-03-01&end=2016-03-31&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All&activities[]=all

Still on ice: No progress on the storage dumps requested by ATLAS (due to no progress in the re-deployment of the DPM head node on SLC6)
- but I have asked to re-discuss this within ADC (in my view this should be implemented at the middleware level)
UNIBE-LHEP_CLOUD and UNIBE-LHEP_CLOUD_MCORE operating stably

Accounting numbers (from scheduler) from last month (Mar 2016) - NOTE: ce03/CLOUD not reported yet
- WC h: 936908 (ATLAS) - 149450 (t2k.org) - 13838 (uboone) - 13 (ops)
Accounting numbers (from ATLAS dashboard) from last month (Mar 2016)
- CPU h: 672148 (933386.8 with cloud)
- WC h: 909450 (1243195.7 with cloud)

UNIBE-ID

All servers (but one) moved from RHEL to CentOS and all puppetized - finally
Short storage outages in March
- in Feb Upgrade ESS-3.0 (GPFS-4.1.0) => ESS-3.5 (GPFS-4.1.1)
- => GPFS cluster overload in certain moments => Stale File Handles
- Turned off certain logging/tracing facilities in GPFS
- now perfectly stable since 3w again
Ordered additional 76 nodes to 32 nodes we ordered last December:
- Intel Xeon E5-2630v4 @ 2.2GHz, 20 cores (HT off)
- 128GB RAM
- => homogenous queue with 108 nodes (2160 core) exclusively for MPI usage
Accounting numbers (from scheduler) from last month (Mar 2016):
- CPU h: 195476
- WC: h: 67481

UNIGE

Production:
- Running smoothly under test mode for ATLAS (still pending some checks)
- High load of cluster from local users (need to check batch system closer, since more chances of nodes down)
- Host certificates recently replaced for DPM Head and Disk nodes + ARC-CE (running late because e-mails were sent to Szymon)
Storage:
- ATLASLOCALGROUPDISK space token was almost full, now (after some cleaning of old datasets) it is at ~ 75% full (~106 TB free)
- Only one user from UniGe with useful dataset at CSCS, moving datasets to UniGe. Then, merge ATLASLOCALGROUPDISK with ATLASSCRATCHDISK
- Providing ATLAS storage dumps every month
Outlook:
- 3 User Interfaces with SLC5 will be decommissioned and maybe a good chance to start moving to CentOS
Accounting numbers (from scheduler) from last month (Files attached for Feb 2016 and Jan-Feb 2016)

NGI_CH

Nothing of relevance
NGI-CH Open Tickets review * NGI-CH Open Tickets review *
- CSCS-LCG2
- - 120551: CSCS-LCG2_MCORE : 75%+ jobs failed with ... (ATLAS team) - Not fully fixed yet (blacklisted right now, some HC jobs do not run)
  - 120505: Large amount of GLEXEC ERRORS on T2_CH_C.. (CMS) - Not touched for a week, changed to "waiting for reply"
  - 120405: Problem with accessing files at CSCS-LCG... (LHCb team) - In progress
  - 119171: Workflow failures at T2_CH_CSCS (CMS) - Changed to "waiting for reply"
- UNIBE-LHEP
  - 120257: glidein validation errors for Microboone... (UBOONE) - Following up on OSG, this should be closed
  - 117899: ATLAS request- storage consistency check... (ATLAS) - On hold
- NGI_CH
  - 120184: NGI_CH - February 2016 - RP/RC OLA performance - Slow response to 2 tickets (average March response 8.51):
    - https://ggus.eu/?mode=ticket_info&ticket_id=120045 (LHCb on arcbrisi)
    - https://ggus.eu/?mode=ticket_info&ticket_id=120293 (duplicate of the above, handled immediately, so: ???)
  - "please remind to set the proper status when handling the tickets"
  - replied to it now