<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2016-06-02 at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598) * *External link*: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE * *Phone gate*: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) * *Switch Vidyo SIP IP*: 137.138.248.204 %TOC% ---++ Site status ---+++ CSCS * CREAM CEs dismission proceeding: currently checking APEL accounting before removing them from GOCDB to avoid any risks about loosing official accounting data * Nagios re-installation on going * Working to bring back accounting data after migration to the new cluster: it should be possible to perform queries in a more flexible way (details upcoming) * Downtime set to replace CPU with v4 version on latest 40 WNs (to be done by Dalco) <em><strong>dCache</strong><br /></em> * some tunings and puppet integration on the new storage (SE 23-26) * planning puppet integration on the rest of the storage infrastructure * IBM DC3500 decomissioned __GPFS__ * will apply the security patch (<a target="_blank" href="http://www-01.ibm.com/support/docview.wss?uid=ssg1S1005781">CVE-2016-0392</a>) asap (v 3.5.0.31) * soon: move metadata to SAN Flash * next: move to Spectrum Scale 4.2.x and evaluate the possibility to enable the Highly-available write cache (<a target="_blank" href="http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.1.1/com.ibm.spectrum.scale.v4r11.adv.doc/bl1adv_hawc.htm">HAWC</a>) on the new (40) nodes ---+++ PSI [[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]] <br /> *dCache 2.15 SQL* * I've found the time to update my SQL code for Chimera as in dCache 2.15 * https://bitbucket.org/fabio79ch/v_pnfs/wiki/Home * https://bitbucket.org/fabio79ch/v_pnfs/branch/master ( Chimera as in dCache 2.2 - 2.13 ) * https://bitbucket.org/fabio79ch/v_pnfs/branch/2.15 * once you've have installed the code you will get out of the box this /pnfs report, the /pnfs dirs ordered by their size, to be refreshed every night : * curl http://t3mon.psi.ch/ganglia/PSIT3-custom/v_pnfs_top_dirs.txt 2>/dev/null * and you can invite users to delete their unnecessary big dirs by for instance : * uberftp YOUR_SE 'rm -r /pnfs/a/b/c/target_dir' *dCache 2.15 Derek's utilities* * need to update the Derek's https://github.com/dfeich/dcache-shellutils utilities for dCache 2.15 *dCache 2.15 new Storage* * During 2016 we have to replace ~200TB net ; I see 3 options : 1 4U-60disks http://www.netapp.com/us/products/storage-systems/e2700/index.aspx ( cheap / slow / big enough ) _probably this is enough_ 1 4U-60disks http://www.netapp.com/us/products/storage-systems/e5600/index.aspx ( expensive / fast / big enough ) 1 4U-90disks http://www.supermicro.com/products/chassis/4U/847/SC847DE2C-R2K04JBOD.cfm ( cheap / fast / bigger ) ; it needs ZFS on Linux *Debugging the CMS Job Logs* * Found a way to allow Miguel and the other CSCS colleagues to browse the CMS Job Logs even if their X509 is unauthorized * in general the recent =arcbrisi= jobs are on http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=JobDetailedView&p=1&records=200&activemenu=0&usr=&site=T2_CH_CSCS&ce=arcbrisi.cscs.ch ; each job features a 'Job Detail View' field ; each of them features a JobLog field ; these Job logs are hosted either on a server like http://submit-5.t2.ucsd.edu/.. ( PLAIN HTTP => NO ISSUES ) or they're hosted at CERN on a server like https://cmsweb.cern.ch/scheddmon/096/cms1266/160601_105724:gechen_crab_BuToJpsiK_MC_GENOnly_8TeV_Ntuples_v2/job_out.1.0.txt ( HTTPS asking your X509 => YOU CAN'T ACCESS THEM ) * For the latter case open in a 1st terminal : * =ssh -D 12345 YOURACCOUNT@lxplus.cern.ch= * And in a 2nd terminal *rewrite the https URL as* : * <pre>curl --socks5 localhost:12345 http://vocms096.cern.ch/cms1266/160601_105724:gechen_crab_BuToJpsiK_MC_GENOnly_8TeV_Ntuples_v2/job_out.1.0.txt</pre> *Listing the recent 24h CMS Jobs at CSCS by CLI* * so you can =grep= what you want *but* the Job Log URL :( ; the don't publish it, you still need the CMS DashBoard * <pre>for CE in arc01.lcg.cscs.ch arc02.lcg.cscs.ch arc03.lcg.cscs.ch arcbrisi.cscs.ch ; do echo NEXT-CE=$CE ; curl --stderr - "http://dashb-cms-job.cern.ch/dashboard/request.py/jobstatus2?user=&site=T2_CH_CSCS&submissiontool=&application=&activity=&status=&check=&tier=&sortby=&ce=$CE&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=&accesstype=&inputse=&cores=&date1=&date2=&count=0&offset=0&exitcode=&fail=&cat=&len=5000&prettyprint" ; done</pre> *Fabio's Leaves* * *{* [20-24] Jun , [11-15] Jul , [25-29] Jul , [8-12] Ago , [22-26] Ago *}* * I'll reply to your emails with big latencies ---+++ UNIBE-LHEP *Operations* * stable, no incidents to report *ATLAS specific operations* * 40% of ATLAS/CH WT, but 67% CPUtime in May (all jobs) - CSCS shows >60% FAILED WT [1] (most of them are "SIGTERM from the batch system" and "error in copying the file from job workdir to local SE" - will open a rt ticket to follow up on this) * DPM head node migration to SLC6 and ATLAS storage dumps still on hold *HammerCloud report [2]* * UNIBE-LHEP online >92% (last month). Better than previous month. Still room for improvement, but not too big impact since interruptions are not long enough to cause the site to drain. * UNIBE-ID >99% * UNIBE-LHEP_CLOUD* <90% (lost hearbeat from pilot: some intermittent network instabilities) [1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptionsxml?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=CH-CHIPP-CSCS&resourcetype=All&activities=all&sitesSort=2&sitesCatSort=2&start=2016-05-01&end=2016-05-31&timeRange=daily&granularity=Monthly&generic=0&sortBy=0&series=All&type=gstb [2] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE * *Accounting numbers (from scheduler) from last month (May 2016)* ( includes ce03/CLOUD ) * WC h: 1211030 (ATLAS) - 23599 (t2k.org) - 282 (uboone) - 7 (ops) * *Accounting numbers (from ATLAS dashboard) from last month* (May 2016) * CPU h: 1194137 * WC h: 1358408 ---+++ UNIBE-ID * Smooth operation in general; no outages * Mitigation has been setup for high fail rate for ATALAS jobs (SIGKILL due to h_vmem violation) by increasing multiplier in submit-job-sge => decrease of fail rate but more resource waste. * Medium-term goal: Move from OG-SGE to Slurm (essentialy a matter of user acceptance, not a technical issue) * As previously announced, 2-day downtime next week: IB-Recabiling (8 => 16 spine switches); provisioning of 2160 cores (Broadwell) * Accounting number (from scheduler) from last month for ATLAS: * CPU h: 135'276 * WC h: 108'001 ---+++ UNIGE * Xxx * Accounting numbers (from scheduler) from last month ---+++ NGI_CH * WLCG plans to retire the requirement for sites to run a site-bdii. EGI sees it differently. Long ongoing discussion, including a WLCG Task Force assigned to this. Stay tuned, but don't hold your breath : -) * Heads up: current funding for the minimal NGI_CH operation layer (10%FTE) will end by end of year. Will need to identify a solution. Also open from end of the year are the EGI fee (hopefully it will go on Swing) and the certificates (~30kCHF including ~10% FTE for operation). Now not only strictly CHIPP uses certificates. * *NGI-CH Open Tickets review* 1 <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120405">120405</a> for CSCS (LHCb) Red: "very urgent", last update on 2016-05-11. Reply awaited from site. 1 <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> for UNIBE-LHEP (ATLAS) On hold (ATLAS request- storage dumps) ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS:Dario, Dino, Gianni * CMS: Fabio, Joosep ? * ATLAS: apologies: Gianfranco (at NorduGrid 2016 conference), Nico Färber (UNIBE-ID) * LHCb: * EGI: apologies: Gianfranco (at NorduGrid 2016 conference) ---++ Action items * Item1
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingSwissGridOperations20160602
Topic revision: r7 - 2016-06-02 - GianniRicciardi
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback