Tags:
meeting
1
SwissGridOperationsMeeting
1
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2016-06-02 at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598) * *External link*: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE * *Phone gate*: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) * *Switch Vidyo SIP IP*: 137.138.248.204 %TOC% ---++ Site status ---+++ CSCS * CREAM CEs dismission proceeding: currently checking APEL accounting before removing them from GOCDB to avoid any risks about loosing official accounting data * Nagios re-installation on going * Working to bring back accounting data after migration to the new cluster: it should be possible to perform queries in a more flexible way (details upcoming) * Downtime set to replace CPU with v4 version on latest 40 WNs (to be done by Dalco) <em><strong>dCache</strong><br /></em> * some tunings and puppet integration on the new storage (SE 23-26) * planning puppet integration on the rest of the storage infrastructure * IBM DC3500 decomissioned __GPFS__ * will apply the security patch (<a target="_blank" href="http://www-01.ibm.com/support/docview.wss?uid=ssg1S1005781">CVE-2016-0392</a>) asap (v 3.5.0.31) * soon: move metadata to SAN Flash * next: move to Spectrum Scale 4.2.x and evaluate the possibility to enable the Highly-available write cache (<a target="_blank" href="http://www.ibm.com/support/knowledgecenter/en/STXKQY_4.1.1/com.ibm.spectrum.scale.v4r11.adv.doc/bl1adv_hawc.htm">HAWC</a>) on the new (40) nodes ---+++ PSI [[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]] <br /> *dCache 2.15 SQL* * I've found the time to update my SQL code for Chimera as in dCache 2.15 * https://bitbucket.org/fabio79ch/v_pnfs/wiki/Home * https://bitbucket.org/fabio79ch/v_pnfs/branch/master ( Chimera as in dCache 2.2 - 2.13 ) * https://bitbucket.org/fabio79ch/v_pnfs/branch/2.15 * once you've have installed the code you will get out of the box this /pnfs report, the /pnfs dirs ordered by their size, to be refreshed every night : * curl http://t3mon.psi.ch/ganglia/PSIT3-custom/v_pnfs_top_dirs.txt 2>/dev/null * and you can invite users to delete their unnecessary big dirs by for instance : * uberftp YOUR_SE 'rm -r /pnfs/a/b/c/target_dir' *dCache 2.15 Derek's utilities* * need to update the Derek's https://github.com/dfeich/dcache-shellutils utilities for dCache 2.15 *dCache 2.15 new Storage* * During 2016 we have to replace ~200TB net ; I see 3 options : 1 4U-60disks http://www.netapp.com/us/products/storage-systems/e2700/index.aspx ( cheap / slow / big enough ) _probably this is enough_ 1 4U-60disks http://www.netapp.com/us/products/storage-systems/e5600/index.aspx ( expensive / fast / big enough ) 1 4U-90disks http://www.supermicro.com/products/chassis/4U/847/SC847DE2C-R2K04JBOD.cfm ( cheap / fast / bigger ) ; it needs ZFS on Linux *Debugging the CMS Job Logs* * Found a way to allow Miguel and the other CSCS colleagues to browse the CMS Job Logs even if their X509 is unauthorized * in general the recent =arcbrisi= jobs are on http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=JobDetailedView&p=1&records=200&activemenu=0&usr=&site=T2_CH_CSCS&ce=arcbrisi.cscs.ch ; each job features a 'Job Detail View' field ; each of them features a JobLog field ; these Job logs are hosted either on a server like http://submit-5.t2.ucsd.edu/.. ( PLAIN HTTP => NO ISSUES ) or they're hosted at CERN on a server like https://cmsweb.cern.ch/scheddmon/096/cms1266/160601_105724:gechen_crab_BuToJpsiK_MC_GENOnly_8TeV_Ntuples_v2/job_out.1.0.txt ( HTTPS asking your X509 => YOU CAN'T ACCESS THEM ) * For the latter case open in a 1st terminal : * =ssh -D 12345 YOURACCOUNT@lxplus.cern.ch= * And in a 2nd terminal *rewrite the https URL as* : * <pre>curl --socks5 localhost:12345 http://vocms096.cern.ch/cms1266/160601_105724:gechen_crab_BuToJpsiK_MC_GENOnly_8TeV_Ntuples_v2/job_out.1.0.txt</pre> *Listing the recent 24h CMS Jobs at CSCS by CLI* * so you can =grep= what you want *but* the Job Log URL :( ; the don't publish it, you still need the CMS DashBoard * <pre>for CE in arc01.lcg.cscs.ch arc02.lcg.cscs.ch arc03.lcg.cscs.ch arcbrisi.cscs.ch ; do echo NEXT-CE=$CE ; curl --stderr - "http://dashb-cms-job.cern.ch/dashboard/request.py/jobstatus2?user=&site=T2_CH_CSCS&submissiontool=&application=&activity=&status=&check=&tier=&sortby=&ce=$CE&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=&accesstype=&inputse=&cores=&date1=&date2=&count=0&offset=0&exitcode=&fail=&cat=&len=5000&prettyprint" ; done</pre> *Fabio's Leaves* * *{* [20-24] Jun , [11-15] Jul , [25-29] Jul , [8-12] Ago , [22-26] Ago *}* * I'll reply to your emails with big latencies ---+++ UNIBE-LHEP *Operations* * stable, no incidents to report *ATLAS specific operations* * 40% of ATLAS/CH WT, but 67% CPUtime in May (all jobs) - CSCS shows >60% FAILED WT [1] (most of them are "SIGTERM from the batch system" and "error in copying the file from job workdir to local SE" - will open a rt ticket to follow up on this) * DPM head node migration to SLC6 and ATLAS storage dumps still on hold *HammerCloud report [2]* * UNIBE-LHEP online >92% (last month). Better than previous month. Still room for improvement, but not too big impact since interruptions are not long enough to cause the site to drain. * UNIBE-ID >99% * UNIBE-LHEP_CLOUD* <90% (lost hearbeat from pilot: some intermittent network instabilities) [1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptionsxml?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=CH-CHIPP-CSCS&resourcetype=All&activities=all&sitesSort=2&sitesCatSort=2&start=2016-05-01&end=2016-05-31&timeRange=daily&granularity=Monthly&generic=0&sortBy=0&series=All&type=gstb [2] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE * *Accounting numbers (from scheduler) from last month (May 2016)* ( includes ce03/CLOUD ) * WC h: 1211030 (ATLAS) - 23599 (t2k.org) - 282 (uboone) - 7 (ops) * *Accounting numbers (from ATLAS dashboard) from last month* (May 2016) * CPU h: 1194137 * WC h: 1358408 ---+++ UNIBE-ID * Smooth operation in general; no outages * Mitigation has been setup for high fail rate for ATALAS jobs (SIGKILL due to h_vmem violation) by increasing multiplier in submit-job-sge => decrease of fail rate but more resource waste. * Medium-term goal: Move from OG-SGE to Slurm (essentialy a matter of user acceptance, not a technical issue) * As previously announced, 2-day downtime next week: IB-Recabiling (8 => 16 spine switches); provisioning of 2160 cores (Broadwell) * Accounting number (from scheduler) from last month for ATLAS: * CPU h: 135'276 * WC h: 108'001 ---+++ UNIGE * Xxx * Accounting numbers (from scheduler) from last month ---+++ NGI_CH * WLCG plans to retire the requirement for sites to run a site-bdii. EGI sees it differently. Long ongoing discussion, including a WLCG Task Force assigned to this. Stay tuned, but don't hold your breath : -) * Heads up: current funding for the minimal NGI_CH operation layer (10%FTE) will end by end of year. Will need to identify a solution. Also open from end of the year are the EGI fee (hopefully it will go on Swing) and the certificates (~30kCHF including ~10% FTE for operation). Now not only strictly CHIPP uses certificates. * *NGI-CH Open Tickets review* 1 <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=120405">120405</a> for CSCS (LHCb) Red: "very urgent", last update on 2016-05-11. Reply awaited from site. 1 <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> for UNIBE-LHEP (ATLAS) On hold (ATLAS request- storage dumps) ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS:Dario, Dino, Gianni * CMS: Fabio, Joosep ? * ATLAS: apologies: Gianfranco (at NorduGrid 2016 conference), Nico Färber (UNIBE-ID) * LHCb: * EGI: apologies: Gianfranco (at NorduGrid 2016 conference) ---++ Action items * Item1
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r7
<
r6
<
r5
<
r4
<
r3
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r7 - 2016-06-02
-
GianniRicciardi
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback