Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2016-06-02 at 14:00

Site status

CSCS

  • CREAM CEs dismission proceeding: currently checking APEL accounting before removing them from GOCDB to avoid any risks about loosing official accounting data
  • Nagios re-installation on going
  • Working to bring back accounting data after migration to the new cluster: it should be possible to perform queries in a more flexible way (details upcoming)
  • Downtime set to replace CPU with v4 version on latest 40 WNs (to be done by Dalco)
dCache
  • some tunings and puppet integration on the new storage (SE 23-26)
  • planning puppet integration on the rest of the storage infrastructure
  • IBM DC3500 decomissioned
GPFS
  • will apply the security patch (CVE-2016-0392) asap (v 3.5.0.31)
  • soon: move metadata to SAN Flash
  • next: move to Spectrum Scale 4.2.x and evaluate the possibility to enable the Highly-available write cache (HAWC) on the new (40) nodes

PSI

Accounting numbers (from scheduler) from last month
dCache 2.15 SQL

dCache 2.15 Derek's utilities dCache 2.15 new Storage Debugging the CMS Job Logs Listing the recent 24h CMS Jobs at CSCS by CLI
  • so you can grep what you want but the Job Log URL frown ; the don't publish it, you still need the CMS DashBoard
  • for CE in arc01.lcg.cscs.ch arc02.lcg.cscs.ch arc03.lcg.cscs.ch arcbrisi.cscs.ch ; do echo NEXT-CE=$CE ; curl --stderr - "http://dashb-cms-job.cern.ch/dashboard/request.py/jobstatus2?user=&site=T2_CH_CSCS&submissiontool=&application=&activity=&status=&check=&tier=&sortby=&ce=$CE&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=&accesstype=&inputse=&cores=&date1=&date2=&count=0&offset=0&exitcode=&fail=&cat=&len=5000&prettyprint" ; done
Fabio's Leaves
  • { [20-24] Jun , [11-15] Jul , [25-29] Jul , [8-12] Ago , [22-26] Ago }
  • I'll reply to your emails with big latencies

UNIBE-LHEP

Operations

  • stable, no incidents to report
ATLAS specific operations
  • 40% of ATLAS/CH WT, but 67% CPUtime in May (all jobs) - CSCS shows >60% FAILED WT [1] (most of them are "SIGTERM from the batch system" and "error in copying the file from job workdir to local SE" - will open a rt ticket to follow up on this)
  • DPM head node migration to SLC6 and ATLAS storage dumps still on hold
HammerCloud report [2]
  • UNIBE-LHEP online >92% (last month). Better than previous month. Still room for improvement, but not too big impact since interruptions are not long enough to cause the site to drain.
  • UNIBE-ID >99%
  • UNIBE-LHEP_CLOUD* <90% (lost hearbeat from pilot: some intermittent network instabilities)
[1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptionsxml?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=CH-CHIPP-CSCS&resourcetype=All&activities=all&sitesSort=2&sitesCatSort=2&start=2016-05-01&end=2016-05-31&timeRange=daily&granularity=Monthly&generic=0&sortBy=0&series=All&type=gstb

[2] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE

  • Accounting numbers (from scheduler) from last month (May 2016) ( includes ce03/CLOUD )
    • WC h: 1211030 (ATLAS) - 23599 (t2k.org) - 282 (uboone) - 7 (ops)
  • Accounting numbers (from ATLAS dashboard) from last month (May 2016)
    • CPU h: 1194137
    • WC h: 1358408

UNIBE-ID

  • Smooth operation in general; no outages
  • Mitigation has been setup for high fail rate for ATALAS jobs (SIGKILL due to h_vmem violation) by increasing multiplier in submit-job-sge => decrease of fail rate but more resource waste.
    • Medium-term goal: Move from OG-SGE to Slurm (essentialy a matter of user acceptance, not a technical issue)
  • As previously announced, 2-day downtime next week: IB-Recabiling (8 => 16 spine switches); provisioning of 2160 cores (Broadwell)
  • Accounting number (from scheduler) from last month for ATLAS:
    • CPU h: 135'276
    • WC h: 108'001

UNIGE

  • Xxx
  • Accounting numbers (from scheduler) from last month

NGI_CH

  • WLCG plans to retire the requirement for sites to run a site-bdii. EGI sees it differently. Long ongoing discussion, including a WLCG Task Force assigned to this. Stay tuned, but don't hold your breath : -)
  • Heads up: current funding for the minimal NGI_CH operation layer (10%FTE) will end by end of year. Will need to identify a solution. Also open from end of the year are the EGI fee (hopefully it will go on Swing) and the certificates (~30kCHF including ~10% FTE for operation). Now not only strictly CHIPP uses certificates.

  • NGI-CH Open Tickets review
  1. 120405 for CSCS (LHCb) Red: "very urgent", last update on 2016-05-11. Reply awaited from site.
  2. 117899 for UNIBE-LHEP (ATLAS) On hold (ATLAS request- storage dumps)

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS:Dario, Dino, Gianni
  • CMS: Fabio, Joosep ?
  • ATLAS: apologies: Gianfranco (at NorduGrid 2016 conference), Nico Färber (UNIBE-ID)
  • LHCb:
  • EGI: apologies: Gianfranco (at NorduGrid 2016 conference)

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r7 - 2016-06-02 - GianniRicciardi
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback