Swiss Grid Operations Meeting on 2016-04-07 at 15:30
Site status
CSCS
- New ARC CE instance (arc03) installed along with a new SLURM instance (15.08.8) and all the recently purchased WNs
(this cluster is integrated into CSCS LDAP and central SLURM DB)
- certificates mess last week (Gianni's fault!): thanks to Gianfranco and Sigve for their help
- some time spent fixing the Information System (GGUS 118922)
- tentative planned maintenance on 20160503 to replace IB/Eth bridges, moving some VMs, reinstalling arc02
- CREAM CEs to be dismissed by beginning of June
- Accounting numbers (from scheduler) from last month
GPFS
- No issues to report
- Metadata from local SSD to FC Flash migration should be performed on May 3rd
dCache
- Almost ready to deploy the first 500TB of new storage (from NETAPP 5560)
- The additional 500TB will be ready by the first part of May (from SFA12K)
- Investigating some "unexpected" files deletion (CMS)
PSI
- Put in production the new CentOS7/ZFS/NFSv4 /homes hierarchy
- Installing 9 new Dalco servers ( got 2 disks dead on arrival ) ; each :
- Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz, 64 cores ( HT on )
- 128GB RAM
- 4 disks 900GB 10k SAS in mdadm 1+0 by Kickstart
- made a 100GB partition formatted as XFS in order to test FS-Cache + NFSv4
- Accounting numbers (from scheduler) from last month
UNIBE-LHEP
Operations
- mostly stable operation on both systems, except for:
- some random failures on some ce01 nodes ( trans: Transformation not installed in CE)
- leads to flipping between black and white-listing by HC
- usually a cvmfs related problem, but cvmfs reports fine on all nodes
- under investigation right now
- eth0 dropped twice within 12h on the ce01 lustre mds:
Mar 31 08:26:14 mds-2-1 kernel: irq 75: nobody cared (try booting with the "irqpoll" option)...Mar 31 08:26:31 mds-2-1 kernel: e1000e 0000:03:00.0: eth0: Reset adapter unexpectedly
- leaves lustre hanging, needs power-cycling to recover (lustre come back is quick)
- maybe flacky h/w, getting a spare card to plug in case or recurrance
ATLAS specific operations
- HC online 33% (last month, single core only - not huge impact since over 80% of work is MCORE):
http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE
- 63% of ATLAS/CH WT, 70% CPUtime in March:
http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites[]=CSCS-LCG2&sites[]=UNIBE-LHEP&sitesCat[]=CH-CHIPP-CSCS&resourcetype=All&sitesSort=2&sitesCatSort=2&start=2016-03-01&end=2016-03-31&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All&activities[]=all
- Still on ice: No progress on the storage dumps requested by ATLAS (due to no progress in the re-deployment of the DPM head node on SLC6)
- but I have asked to re-discuss this within ADC (in my view this should be implemented at the middleware level)
- UNIBE-LHEP_CLOUD and UNIBE-LHEP_CLOUD_MCORE operating stably
- Accounting numbers (from scheduler) from last month (Mar 2016) - NOTE: ce03/CLOUD not reported yet
- WC h: 936908 (ATLAS) - 149450 (t2k.org) - 13838 (uboone) - 13 (ops)
- Accounting numbers (from ATLAS dashboard) from last month (Mar 2016)
- CPU h: 672148 (933386.8 with cloud)
- WC h: 909450 (1243195.7 with cloud)
UNIBE-ID
- All servers (but one) moved from RHEL to CentOS and all puppetized - finally
- Short storage outages in March
- in Feb Upgrade ESS-3.0 (GPFS-4.1.0) => ESS-3.5 (GPFS-4.1.1)
- => GPFS cluster overload in certain moments => Stale File Handles
- Turned off certain logging/tracing facilities in GPFS
- now perfectly stable since 3w again
- Ordered additional 76 nodes to 32 nodes we ordered last December:
- Intel Xeon E5-2630v4 @ 2.2GHz, 20 cores (HT off)
- 128GB RAM
- => homogenous queue with 108 nodes (2160 core) exclusively for MPI usage
- Accounting numbers (from scheduler) from last month (Mar 2016):
- CPU h: 195476
- WC: h: 67481
UNIGE
- Production:
- Running smoothly under test mode for ATLAS (still pending some checks)
- High load of cluster from local users (need to check batch system closer, since more chances of nodes down)
- Host certificates recently replaced for DPM Head and Disk nodes + ARC-CE (running late because e-mails were sent to Szymon)
- Storage:
- ATLASLOCALGROUPDISK space token was almost full, now (after some cleaning of old datasets) it is at ~ 75% full (~106 TB free)
- Only one user from UniGe with useful dataset at CSCS, moving datasets to UniGe. Then, merge ATLASLOCALGROUPDISK with ATLASSCRATCHDISK
- Providing ATLAS storage dumps every month
- Outlook:
- 3 User Interfaces with SLC5 will be decommissioned and maybe a good chance to start moving to CentOS
- Accounting numbers (from scheduler) from last month (Files attached for Feb 2016 and Jan-Feb 2016)
NGI_CH
- Nothing of relevance
- NGI-CH Open Tickets review * NGI-CH Open Tickets review *
-
-
- 120551: CSCS-LCG2_MCORE : 75%+ jobs failed with ... (ATLAS team) - Not fully fixed yet (blacklisted right now, some HC jobs do not run)
- 120505: Large amount of GLEXEC ERRORS on T2_CH_C.. (CMS) - Not touched for a week, changed to "waiting for reply"
- 120405: Problem with accessing files at CSCS-LCG... (LHCb team) - In progress
- 119171: Workflow failures at T2_CH_CSCS (CMS) - Changed to "waiting for reply"
- UNIBE-LHEP
- 120257: glidein validation errors for Microboone... (UBOONE) - Following up on OSG, this should be closed
- 117899: ATLAS request- storage consistency check... (ATLAS) - On hold
- NGI_CH
- 120184: NGI_CH - February 2016 - RP/RC OLA performance - Slow response to 2 tickets (average March response 8.51):
- "please remind to set the proper status when handling the tickets"
- replied to it now
Other topics
Next meeting date:
A.O.B.
Attendants
- CSCS: Pablo, Dario, Dino, Gianni
- CMS: Fabio
- ATLAS: Luis
- LHCb: Roland
- EGI:
Action items
* Item1 * Item1* Item1 * Item1