MeetingSwissGridOperations20160602 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2016-06-02 at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598)
   * *External link*: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
   * *Switch Vidyo SIP IP*: 137.138.248.204
%TOC%

---++ Site status
---+++ CSCS
   * CREAM CEs dismission proceeding: currently checking APEL accounting before removing them from GOCDB to avoid any risks about loosing official accounting data
   * Nagios re-installation on going
   * Working to bring back accounting data after migration to the new cluster: it should be possible to perform queries in a more flexible way (details upcoming)
   * Downtime set to replace CPU with v4 version on latest 40 WNs (to be done by Dalco)
<em><strong>dCache</strong><br /></em>
   * some tunings and puppet integration on the new storage (SE 23-26)
   * planning puppet integration on the rest of the storage infrastructure
   * IBM DC3500 decomissioned
__GPFS__
   * will apply the security patch (<a target="_blank" href="http://www-01.ibm.com/support/docview.wss?uid=ssg1S1005781">CVE-2016-0392</a>) asap (v 3.5.0.31)
   * soon: move metadata to SAN Flash
   * next: move to Spectrum Scale 4.2.x and evaluate the possibility to enable the Highly-available write cache (<a target="_blank" href="http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.1.1/com.ibm.spectrum.scale.v4r11.adv.doc/bl1adv_hawc.htm">HAWC</a>) on the new (40) nodes

---+++ PSI

[[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]] <br /> *dCache 2.15 SQL*
   * I've found the time to update my SQL code for Chimera as in dCache 2.15 
      * https://bitbucket.org/fabio79ch/v_pnfs/wiki/Home
      * https://bitbucket.org/fabio79ch/v_pnfs/branch/master ( Chimera as in dCache 2.2 - 2.13 )
      * https://bitbucket.org/fabio79ch/v_pnfs/branch/2.15
   * once you've have installed the code you will get out of the box this /pnfs report, the /pnfs dirs ordered by their size, to be refreshed every night : 
      * curl http://t3mon.psi.ch/ganglia/PSIT3-custom/v_pnfs_top_dirs.txt 2&gt;/dev/null
   * and you can invite users to delete their unnecessary big dirs by for instance : 
      * uberftp YOUR_SE 'rm -r /pnfs/a/b/c/target_dir'
*dCache 2.15 Derek's utilities*
   * need to update the Derek's https://github.com/dfeich/dcache-shellutils utilities for dCache 2.15
*dCache 2.15 new Storage*
   * During 2016 we have to replace ~200TB net ; I see 3 options : 
      1 4U-60disks http://www.netapp.com/us/products/storage-systems/e2700/index.aspx ( cheap / slow / big enough ) _probably this is enough_
      1 4U-60disks http://www.netapp.com/us/products/storage-systems/e5600/index.aspx ( expensive / fast / big enough )
      1 4U-90disks http://www.supermicro.com/products/chassis/4U/847/SC847DE2C-R2K04JBOD.cfm ( cheap / fast / bigger ) ; it needs ZFS on Linux
*Debugging the CMS Job Logs*
   * Found a way to allow Miguel and the other CSCS colleagues to browse the CMS Job Logs even if their X509 is unauthorized
   * in general the recent =arcbrisi= jobs are on http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=JobDetailedView&p=1&records=200&activemenu=0&usr=&site=T2_CH_CSCS&ce=arcbrisi.cscs.ch ; each job features a 'Job Detail View' field ; each of them features a JobLog field ; these Job logs are hosted either on a server like http://submit-5.t2.ucsd.edu/.. ( PLAIN HTTP =&gt; NO ISSUES ) or they're hosted at CERN on a server like https://cmsweb.cern.ch/scheddmon/096/cms1266/160601_105724:gechen_crab_BuToJpsiK_MC_GENOnly_8TeV_Ntuples_v2/job_out.1.0.txt ( HTTPS asking your X509 =&gt; YOU CAN'T ACCESS THEM )
   * For the latter case open in a 1st terminal :
   * =ssh -D 12345 YOURACCOUNT@lxplus.cern.ch=
   * And in a 2nd terminal *rewrite the https URL as* :
   * <pre>curl --socks5 localhost:12345 http://vocms096.cern.ch/cms1266/160601_105724:gechen_crab_BuToJpsiK_MC_GENOnly_8TeV_Ntuples_v2/job_out.1.0.txt</pre>
*Listing the recent 24h CMS Jobs at CSCS by CLI*
   * so you can =grep= what you want *but* the Job Log URL :( ; the don't publish it, you still need the CMS DashBoard
   * <pre>for CE in arc01.lcg.cscs.ch arc02.lcg.cscs.ch arc03.lcg.cscs.ch arcbrisi.cscs.ch ; do echo NEXT-CE=$CE ; curl --stderr - "http://dashb-cms-job.cern.ch/dashboard/request.py/jobstatus2?user=&site=T2_CH_CSCS&submissiontool=&application=&activity=&status=&check=&tier=&sortby=&ce=$CE&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=&accesstype=&inputse=&cores=&date1=&date2=&count=0&offset=0&exitcode=&fail=&cat=&len=5000&prettyprint" ; done</pre>
*Fabio's Leaves*
   * *{* [20-24] Jun , [11-15] Jul , [25-29] Jul , [8-12] Ago , [22-26] Ago *}*
   * I'll reply to your emails with big latencies

---+++ UNIBE-LHEP

*Operations*
   * stable, no incidents to report
*ATLAS specific operations*
   * 40% of ATLAS/CH WT, but 67% CPUtime in May (all jobs) - CSCS shows &gt;60% FAILED WT [1] (most of them are "SIGTERM from the batch system" and "error in copying the file from job workdir to local SE" - will open a rt ticket to follow up on this)
   * DPM head node migration to SLC6 and ATLAS storage dumps still on hold
*HammerCloud report [2]*
   * UNIBE-LHEP online &gt;92% (last month). Better than previous month. Still room for improvement, but not too big impact since interruptions are not long enough to cause the site to drain.
   * UNIBE-ID &gt;99%
   * UNIBE-LHEP_CLOUD* &lt;90% (lost hearbeat from pilot: some intermittent network instabilities)
[1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptionsxml?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=CH-CHIPP-CSCS&resourcetype=All&activities=all&sitesSort=2&sitesCatSort=2&start=2016-05-01&end=2016-05-31&timeRange=daily&granularity=Monthly&generic=0&sortBy=0&series=All&type=gstb

[2] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE

   * *Accounting numbers (from scheduler) from last month (May 2016)* ( includes ce03/CLOUD ) 
      * WC h: 1211030 (ATLAS) - 23599 (t2k.org) - 282 (uboone) - 7 (ops)
   * *Accounting numbers (from ATLAS dashboard) from last month* (May 2016) 
      * CPU h: 1194137
      * WC h: 1358408
---+++ UNIBE-ID
   * Smooth operation in general; no outages
   * Mitigation has been setup for high fail rate for ATALAS jobs (SIGKILL due to h_vmem violation) by increasing multiplier in submit-job-sge =&gt; decrease of fail rate but more resource waste. 
      * Medium-term goal: Move from OG-SGE to Slurm (essentialy a matter of user acceptance, not a technical issue)
   * As previously announced, 2-day downtime next week: IB-Recabiling (8 =&gt; 16 spine switches); provisioning of 2160 cores (Broadwell)
   * Accounting number (from scheduler) from last month for ATLAS: 
      * CPU h: 135'276
      * WC h: 108'001

---+++ UNIGE
   * Xxx
   * Accounting numbers (from scheduler) from last month

---+++ NGI_CH
   * WLCG plans to retire the requirement for sites to run a site-bdii. EGI sees it differently. Long ongoing discussion, including a WLCG Task Force assigned to this. Stay tuned, but don't hold your breath : -)
   * Heads up: current funding for the minimal NGI_CH operation layer (10%FTE) will end by end of year. Will need to identify a solution. Also open from end of the year are the EGI fee (hopefully it will go on Swing) and the certificates (~30kCHF including ~10% FTE for operation). Now not only strictly CHIPP uses certificates.

   * *NGI-CH Open Tickets review*
   1 <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120405">120405</a> for CSCS (LHCb) Red: "very urgent", last update on 2016-05-11. Reply awaited from site.
   1 <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> for UNIBE-LHEP (ATLAS) On hold (ATLAS request- storage dumps)

---++ Other topics
   * Topic1
   * Topic2
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS:Dario, Dino, Gianni
   * CMS: Fabio, Joosep ?
   * ATLAS: apologies: Gianfranco (at NorduGrid 2016 conference), Nico Färber (UNIBE-ID)
   * LHCb:
   * EGI: apologies: Gianfranco (at NorduGrid 2016 conference)

---++ Action items
   * Item1