Swiss Grid Operations Meeting on 2016-06-02 at 14:00
Site status
CSCS
- CREAM CEs dismission proceeding: currently checking APEL accounting before removing them from GOCDB to avoid any risks about loosing official accounting data
- Nagios re-installation on going
- Working to bring back accounting data after migration to the new cluster: it should be possible to perform queries in a more flexible way (details upcoming)
- Downtime set to replace CPU with v4 version on latest 40 WNs (to be done by Dalco)
dCache
- some tunings and puppet integration on the new storage (SE 23-26)
- planning puppet integration on the rest of the storage infrastructure
- IBM DC3500 decomissioned
GPFS
- will apply the security patch (CVE-2016-0392) asap (v 3.5.0.31)
- soon: move metadata to SAN Flash
- next: move to Spectrum Scale 4.2.x and evaluate the possibility to enable the Highly-available write cache (HAWC) on the new (40) nodes
PSI
Accounting numbers (from scheduler) from last month dCache 2.15 SQL
- I've found the time to update my SQL code for Chimera as in dCache 2.15
- once you've have installed the code you will get out of the box this /pnfs report, the /pnfs dirs ordered by their size, to be refreshed every night :
- and you can invite users to delete their unnecessary big dirs by for instance :
- uberftp YOUR_SE 'rm -r /pnfs/a/b/c/target_dir'
dCache 2.15 Derek's utilities
dCache 2.15 new Storage
Debugging the CMS Job Logs
Listing the recent 24h CMS Jobs at CSCS by CLI
Fabio's Leaves
- { [20-24] Jun , [11-15] Jul , [25-29] Jul , [8-12] Ago , [22-26] Ago }
- I'll reply to your emails with big latencies
UNIBE-LHEP
Operations
- stable, no incidents to report
ATLAS specific operations
- 40% of ATLAS/CH WT, but 67% CPUtime in May (all jobs) - CSCS shows >60% FAILED WT [1] (most of them are "SIGTERM from the batch system" and "error in copying the file from job workdir to local SE" - will open a rt ticket to follow up on this)
- DPM head node migration to SLC6 and ATLAS storage dumps still on hold
HammerCloud report [2]
- UNIBE-LHEP online >92% (last month). Better than previous month. Still room for improvement, but not too big impact since interruptions are not long enough to cause the site to drain.
- UNIBE-ID >99%
- UNIBE-LHEP_CLOUD* <90% (lost hearbeat from pilot: some intermittent network instabilities)
[1]
http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptionsxml?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=CH-CHIPP-CSCS&resourcetype=All&activities=all&sitesSort=2&sitesCatSort=2&start=2016-05-01&end=2016-05-31&timeRange=daily&granularity=Monthly&generic=0&sortBy=0&series=All&type=gstb
[2]
http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE
- Accounting numbers (from scheduler) from last month (May 2016) ( includes ce03/CLOUD )
- WC h: 1211030 (ATLAS) - 23599 (t2k.org) - 282 (uboone) - 7 (ops)
- Accounting numbers (from ATLAS dashboard) from last month (May 2016)
- CPU h: 1194137
- WC h: 1358408
UNIBE-ID
- Smooth operation in general; no outages
- Mitigation has been setup for high fail rate for ATALAS jobs (SIGKILL due to h_vmem violation) by increasing multiplier in submit-job-sge => decrease of fail rate but more resource waste.
- Medium-term goal: Move from OG-SGE to Slurm (essentialy a matter of user acceptance, not a technical issue)
- As previously announced, 2-day downtime next week: IB-Recabiling (8 => 16 spine switches); provisioning of 2160 cores (Broadwell)
- Accounting number (from scheduler) from last month for ATLAS:
- CPU h: 135'276
- WC h: 108'001
UNIGE
- Xxx
- Accounting numbers (from scheduler) from last month
NGI_CH
- WLCG plans to retire the requirement for sites to run a site-bdii. EGI sees it differently. Long ongoing discussion, including a WLCG Task Force assigned to this. Stay tuned, but don't hold your breath : -)
- Heads up: current funding for the minimal NGI_CH operation layer (10%FTE) will end by end of year. Will need to identify a solution. Also open from end of the year are the EGI fee (hopefully it will go on Swing) and the certificates (~30kCHF including ~10% FTE for operation). Now not only strictly CHIPP uses certificates.
- NGI-CH Open Tickets review
- 120405 for CSCS (LHCb) Red: "very urgent", last update on 2016-05-11. Reply awaited from site.
- 117899 for UNIBE-LHEP (ATLAS) On hold (ATLAS request- storage dumps)
Other topics
Next meeting date:
A.O.B.
Attendants
- CSCS:Dario, Dino, Gianni
- CMS: Fabio, Joosep ?
- ATLAS: apologies: Gianfranco (at NorduGrid 2016 conference), Nico Färber (UNIBE-ID)
- LHCb:
- EGI: apologies: Gianfranco (at NorduGrid 2016 conference)
Action items
This topic: LCGTier2
> WebHome >
MeetingsBoard > MeetingSwissGridOperations20160602
Topic revision: r7 - 2016-06-02 - GianniRicciardi