Tags:
create new tag
view all tags

Swiss Grid Operations Meeting on 2016-04-07 at 15:30

Site status

CSCS

  • New ARC CE instance (arc03) installed along with a new SLURM instance (15.08.8) and all the recently purchased WNs
    (this cluster is integrated into CSCS LDAP and central SLURM DB)
  • certificates mess last week (Gianni's fault!): thanks to Gianfranco and Sigve for their help
  • some time spent fixing the Information System (GGUS 118922)
  • tentative planned maintenance on 20160503 to replace IB/Eth bridges, moving some VMs, reinstalling arc02
  • CREAM CEs to be dismissed by beginning of June
  • Accounting numbers (from scheduler) from last month
GPFS
  • No issues to report
  • Metadata from local SSD to FC Flash migration should be performed on May 3rd
dCache
  • Almost ready to deploy the first 500TB of new storage (from NETAPP 5560)
  • The additional 500TB will be ready by the first part of May (from SFA12K)
  • Investigating some "unexpected" files deletion (CMS)

PSI

UNIBE-LHEP

Operations

  • mostly stable operation on both systems, except for:
  • some random failures on some ce01 nodes ( trans: Transformation not installed in CE)
    • leads to flipping between black and white-listing by HC
    • usually a cvmfs related problem, but cvmfs reports fine on all nodes
    • under investigation right now
  • eth0 dropped twice within 12h on the ce01 lustre mds:
Mar 31 08:26:14 mds-2-1 kernel: irq 75: nobody cared (try booting with the "irqpoll" option)
...
Mar 31 08:26:31 mds-2-1 kernel: e1000e 0000:03:00.0: eth0: Reset adapter unexpectedly


  • leaves lustre hanging, needs power-cycling to recover (lustre come back is quick)
  • maybe flacky h/w, getting a spare card to plug in case or recurrance
ATLAS specific operations
  • HC online 33% (last month, single core only - not huge impact since over 80% of work is MCORE):
http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE


  • 63% of ATLAS/CH WT, 70% CPUtime in March:
http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites[]=CSCS-LCG2&sites[]=UNIBE-LHEP&sitesCat[]=CH-CHIPP-CSCS&resourcetype=All&sitesSort=2&sitesCatSort=2&start=2016-03-01&end=2016-03-31&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All&activities[]=all

  • Still on ice: No progress on the storage dumps requested by ATLAS (due to no progress in the re-deployment of the DPM head node on SLC6)
    • but I have asked to re-discuss this within ADC (in my view this should be implemented at the middleware level)
  • UNIBE-LHEP_CLOUD and UNIBE-LHEP_CLOUD_MCORE operating stably

  • Accounting numbers (from scheduler) from last month (Mar 2016) - NOTE: ce03/CLOUD not reported yet
    • WC h: 936908 (ATLAS) - 149450 (t2k.org) - 13838 (uboone) - 13 (ops)
  • Accounting numbers (from ATLAS dashboard) from last month (Mar 2016)
    • CPU h: 672148 (933386.8 with cloud)
    • WC h: 909450 (1243195.7 with cloud)

UNIBE-ID

  • All servers (but one) moved from RHEL to CentOS and all puppetized - finally
  • Short storage outages in March
    • in Feb Upgrade ESS-3.0 (GPFS-4.1.0) => ESS-3.5 (GPFS-4.1.1)
    • => GPFS cluster overload in certain moments => Stale File Handles
    • Turned off certain logging/tracing facilities in GPFS
    • now perfectly stable since 3w again
  • Ordered additional 76 nodes to 32 nodes we ordered last December:
    • Intel Xeon E5-2630v4 @ 2.2GHz, 20 cores (HT off)
    • 128GB RAM
    • => homogenous queue with 108 nodes (2160 core) exclusively for MPI usage
  • Accounting numbers (from scheduler) from last month (Mar 2016):
    • CPU h: 195476
    • WC: h: 67481

UNIGE

  • Production:
    • Running smoothly under test mode for ATLAS (still pending some checks)
    • High load of cluster from local users (need to check batch system closer, since more chances of nodes down)
    • Host certificates recently replaced for DPM Head and Disk nodes + ARC-CE (running late because e-mails were sent to Szymon)
  • Storage:
    • ATLASLOCALGROUPDISK space token was almost full, now (after some cleaning of old datasets) it is at ~ 75% full (~106 TB free)
    • Only one user from UniGe with useful dataset at CSCS, moving datasets to UniGe. Then, merge ATLASLOCALGROUPDISK with ATLASSCRATCHDISK
    • Providing ATLAS storage dumps every month
  • Outlook:
    • 3 User Interfaces with SLC5 will be decommissioned and maybe a good chance to start moving to CentOS
  • Accounting numbers (from scheduler) from last month (Files attached for Feb 2016 and Jan-Feb 2016)

NGI_CH

  • Nothing of relevance
  • NGI-CH Open Tickets review * NGI-CH Open Tickets review *
    • CSCS-LCG2
      • 120551: CSCS-LCG2_MCORE : 75%+ jobs failed with ... (ATLAS team) - Not fully fixed yet (blacklisted right now, some HC jobs do not run)
      • 120505: Large amount of GLEXEC ERRORS on T2_CH_C.. (CMS) - Not touched for a week, changed to "waiting for reply"
      • 120405: Problem with accessing files at CSCS-LCG... (LHCb team) - In progress
      • 119171: Workflow failures at T2_CH_CSCS (CMS) - Changed to "waiting for reply"
    • UNIBE-LHEP
      • 120257: glidein validation errors for Microboone... (UBOONE) - Following up on OSG, this should be closed
      • 117899: ATLAS request- storage consistency check... (ATLAS) - On hold
    • NGI_CH

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS: Pablo, Dario, Dino, Gianni
  • CMS: Fabio
  • ATLAS: Luis
  • LHCb: Roland
  • EGI:

Action items

* Item1 * Item1* Item1 * Item1

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatlog g07.2016.log r1 manage 1.2 K 2016-04-07 - 13:52 LuisMarch  
Unknown file formatlog g07.201602.log r1 manage 1.0 K 2016-04-07 - 13:50 LuisMarch Feb 2016
Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r21 - 2016-04-07 - MichaelRolli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback