Swiss Grid Operations Meeting on 2016-06-02 at 14:00

Site status

CSCS

  • Xxx
  • Accounting numbers (from scheduler) from last month

PSI

UNIBE-LHEP

Operations

  • stable, no incidents to report
ATLAS specific operations
  • 40% of ATLAS/CH WT, but 67% CPUtime in May (all jobs) - CSCS shows >60% FAILED WT [1] (most of them are "SIGTERM from the batch system" and "error in copying the file from job workdir to local SE" - will open a rt ticket to follow up on this)
  • DPM head node migration to SLC6 and ATLAS storage dumps still on hold
HammerCloud report [2]
  • UNIBE-LHEP online >92% (last month). Better than previous month. Still room for improvement, but not too big impact since interruptions are not long enough to cause the site to drain.
  • UNIBE-ID >99%
  • UNIBE-LHEP_CLOUD* <90% (lost hearbeat from pilot: some intermittent network instabilities)
[1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptionsxml?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=CH-CHIPP-CSCS&resourcetype=All&activities=all&sitesSort=2&sitesCatSort=2&start=2016-05-01&end=2016-05-31&timeRange=daily&granularity=Monthly&generic=0&sortBy=0&series=All&type=gstb

[2] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE

  • Accounting numbers (from scheduler) from last month (May 2016) ( includes ce03/CLOUD )
    • WC h: 1211030 (ATLAS) - 23599 (t2k.org) - 282 (uboone) - 7 (ops)
  • Accounting numbers (from ATLAS dashboard) from last month (May 2016)
    • CPU h: 1194137
    • WC h: 1358408

UNIBE-ID

  • Smooth operation in general; no outages
  • Mitigation has been setup for high fail rate for ATALAS jobs (SIGKILL due to h_vmem violation) by increasing multiplier in submit-job-sge => decrease of fail rate but more resource waste.
    • Medium-term goal: Move from OG-SGE to Slurm (essentialy a matter of user acceptance, not a technical issue)
  • As previously announced, 2-day downtime next week: IB-Recabiling (8 => 16 spine switches); provisioning of 2160 cores (Broadwell)
  • Accounting number (from scheduler) from last month for ATLAS:
    • CPU h: 135'276
    • WC h: 108'001

UNIGE

  • Xxx
  • Accounting numbers (from scheduler) from last month

NGI_CH

  • WLCG plans to retire the requirement for sites to run a site-bdii. EGI sees it differently. Long ongoing discussion, including a WLCG Task Force assigned to this. Stay tuned, but don't hold your breath : -)
  • Heads up: current funding for the minimal NGI_CH operation layer (10%FTE) will end by end of year. Will need to identify a solution. Also open from end of the year are the EGI fee (hopefully it will go on Swing) and the certificates (~30kCHF including ~10% FTE for operation). Now not only strictly CHIPP uses certificates.

  • NGI-CH Open Tickets review
  1. 120405 for CSCS (LHCb) Red: "very urgent", last update on 2016-05-11. Reply awaited from site.
  2. 117899 for UNIBE-LHEP (ATLAS) On hold (ATLAS request- storage dumps)

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS:
  • CMS:
  • ATLAS: apologies: Gianfranco (at NorduGrid 2016 conference), Nico Färber (UNIBE-ID)
  • LHCb:
  • EGI: apologies: Gianfranco (at NorduGrid 2016 conference)

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2016-06-02 - MichaelRolli
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback