Swiss Grid Operations Meeting on 2016-06-02 at 14:00

Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598)
External link: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE
Phone gate: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
Switch Vidyo SIP IP: 137.138.248.204

Swiss Grid Operations Meeting on 2016-06-02 at 14:00
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

CREAM CEs dismission proceeding: currently checking APEL accounting before removing them from GOCDB to avoid any risks about loosing official accounting data
Nagios re-installation on going
Working to bring back accounting data after migration to the new cluster: it should be possible to perform queries in a more flexible way (details upcoming)
Downtime set to replace CPU with v4 version on latest 40 WNs (to be done by Dalco)

dCache

some tunings and puppet integration on the new storage (SE 23-26)
planning puppet integration on the rest of the storage infrastructure
IBM DC3500 decomissioned

GPFS

will apply the security patch (CVE-2016-0392) asap (v 3.5.0.31)
soon: move metadata to SAN Flash
next: move to Spectrum Scale 4.2.x and evaluate the possibility to enable the Highly-available write cache (HAWC) on the new (40) nodes

PSI

Accounting numbers (from scheduler) from last month
dCache 2.15 SQL

I've found the time to update my SQL code for Chimera as in dCache 2.15
- https://bitbucket.org/fabio79ch/v_pnfs/wiki/Home
- https://bitbucket.org/fabio79ch/v_pnfs/branch/master ( Chimera as in dCache 2.2 - 2.13 )
- https://bitbucket.org/fabio79ch/v_pnfs/branch/2.15
once you've have installed the code you will get out of the box this /pnfs report, the /pnfs dirs ordered by their size, to be refreshed every night :
- curl http://t3mon.psi.ch/ganglia/PSIT3-custom/v_pnfs_top_dirs.txt 2>/dev/null
and you can invite users to delete their unnecessary big dirs by for instance :
- uberftp YOUR_SE 'rm -r /pnfs/a/b/c/target_dir'

dCache 2.15 Derek's utilities

need to update the Derek's https://github.com/dfeich/dcache-shellutils utilities for dCache 2.15

dCache 2.15 new Storage

During 2016 we have to replace ~200TB net ; I see 3 options :
1. 4U-60disks http://www.netapp.com/us/products/storage-systems/e2700/index.aspx ( cheap / slow / big enough ) probably this is enough
2. 4U-60disks http://www.netapp.com/us/products/storage-systems/e5600/index.aspx ( expensive / fast / big enough )
3. 4U-90disks http://www.supermicro.com/products/chassis/4U/847/SC847DE2C-R2K04JBOD.cfm ( cheap / fast / bigger ) ; it needs ZFS on Linux

Debugging the CMS Job Logs

Found a way to allow Miguel and the other CSCS colleagues to browse the CMS Job Logs even if their X509 is unauthorized
in general the recent arcbrisi jobs are on http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=JobDetailedView&p=1&records=200&activemenu=0&usr=&site=T2_CH_CSCS&ce=arcbrisi.cscs.ch ; each job features a 'Job Detail View' field ; each of them features a JobLog field ; these Job logs are hosted either on a server like http://submit-5.t2.ucsd.edu/.. ( PLAIN HTTP => NO ISSUES ) or they're hosted at CERN on a server like https://cmsweb.cern.ch/scheddmon/096/cms1266/160601_105724:gechen_crab_BuToJpsiK_MC_GENOnly_8TeV_Ntuples_v2/job_out.1.0.txt ( HTTPS asking your X509 => YOU CAN'T ACCESS THEM )
For the latter case open in a 1st terminal :
ssh -D 12345 YOURACCOUNT@lxplus.cern.ch
And in a 2nd terminal rewrite the https URL as :

curl --socks5 localhost:12345 http://vocms096.cern.ch/cms1266/160601_105724:gechen_crab_BuToJpsiK_MC_GENOnly_8TeV_Ntuples_v2/job_out.1.0.txt

Listing the recent 24h CMS Jobs at CSCS by CLI

so you can grep what you want but the Job Log URL ; the don't publish it, you still need the CMS DashBoard

for CE in arc01.lcg.cscs.ch arc02.lcg.cscs.ch arc03.lcg.cscs.ch arcbrisi.cscs.ch ; do echo NEXT-CE=$CE ; curl --stderr - "http://dashb-cms-job.cern.ch/dashboard/request.py/jobstatus2?user=&site=T2_CH_CSCS&submissiontool=&application=&activity=&status=&check=&tier=&sortby=&ce=$CE&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=&accesstype=&inputse=&cores=&date1=&date2=&count=0&offset=0&exitcode=&fail=&cat=&len=5000&prettyprint" ; done

Fabio's Leaves

{ [20-24] Jun , [11-15] Jul , [25-29] Jul , [8-12] Ago , [22-26] Ago }
I'll reply to your emails with big latencies

UNIBE-LHEP

Operations

stable, no incidents to report

ATLAS specific operations

40% of ATLAS/CH WT, but 67% CPUtime in May (all jobs) - CSCS shows >60% FAILED WT [1] (most of them are "SIGTERM from the batch system" and "error in copying the file from job workdir to local SE" - will open a rt ticket to follow up on this)
DPM head node migration to SLC6 and ATLAS storage dumps still on hold

HammerCloud report [2]

UNIBE-LHEP online >92% (last month). Better than previous month. Still room for improvement, but not too big impact since interruptions are not long enough to cause the site to drain.
UNIBE-ID >99%
UNIBE-LHEP_CLOUD* <90% (lost hearbeat from pilot: some intermittent network instabilities)

[1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptionsxml?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=CH-CHIPP-CSCS&resourcetype=All&activities=all&sitesSort=2&sitesCatSort=2&start=2016-05-01&end=2016-05-31&timeRange=daily&granularity=Monthly&generic=0&sortBy=0&series=All&type=gstb

[2] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE

Accounting numbers (from scheduler) from last month (May 2016) ( includes ce03/CLOUD )
- WC h: 1211030 (ATLAS) - 23599 (t2k.org) - 282 (uboone) - 7 (ops)
Accounting numbers (from ATLAS dashboard) from last month (May 2016)
- CPU h: 1194137
- WC h: 1358408

UNIBE-ID

Smooth operation in general; no outages
Mitigation has been setup for high fail rate for ATALAS jobs (SIGKILL due to h_vmem violation) by increasing multiplier in submit-job-sge => decrease of fail rate but more resource waste.
- Medium-term goal: Move from OG-SGE to Slurm (essentialy a matter of user acceptance, not a technical issue)
As previously announced, 2-day downtime next week: IB-Recabiling (8 => 16 spine switches); provisioning of 2160 cores (Broadwell)
Accounting number (from scheduler) from last month for ATLAS:
- CPU h: 135'276
- WC h: 108'001

UNIGE

Xxx
Accounting numbers (from scheduler) from last month

NGI_CH

WLCG plans to retire the requirement for sites to run a site-bdii. EGI sees it differently. Long ongoing discussion, including a WLCG Task Force assigned to this. Stay tuned, but don't hold your breath : -)
Heads up: current funding for the minimal NGI_CH operation layer (10%FTE) will end by end of year. Will need to identify a solution. Also open from end of the year are the EGI fee (hopefully it will go on Swing) and the certificates (~30kCHF including ~10% FTE for operation). Now not only strictly CHIPP uses certificates.

NGI-CH Open Tickets review

120405 for CSCS (LHCb) Red: "very urgent", last update on 2016-05-11. Reply awaited from site.
117899 for UNIBE-LHEP (ATLAS) On hold (ATLAS request- storage dumps)

A.O.B.

Attendants

CSCS:Dario, Dino, Gianni
CMS: Fabio, Joosep ?
ATLAS: apologies: Gianfranco (at NorduGrid 2016 conference), Nico Färber (UNIBE-ID)
LHCb:
EGI: apologies: Gianfranco (at NorduGrid 2016 conference)

Action items

Item1

This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingSwissGridOperations20160602
Topic revision: r7 - 2016-06-02 - GianniRicciardi