<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2016-11-11 at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598) * *External link*: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE * *Phone gate*: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) * *Switch Vidyo SIP IP*: 137.138.248.204 %TOC% ---++ Site status ---+++ CSCS ---+++++ *Quick report HEPiX fall 2016 (first time at HEPiX)* <div style="background-color: transparent;" id="_mcePaste"> * Around 100 partecipants * __Running HEP Workloads on the NERSC HPC (Tony Quan presentation)__ * Specs: * Cori phase 1 (1630 Haswel nodes) * Cori phase 2 (9399 Knights nodes) * Different cvmfs approach, not working for us. (We are using it native on nodes) * <span style="background-color: transparent;"> __A lot of site reports (2 full days)__ </span> * GPFS, Hadoop, Dropbox (CERNBox at CERN) used by many sites as storage solutions * Lustre widely used * Starting HPC and HTC integration activities * <span style="background-color: transparent;">OpenStack</span><span style="background-color: transparent;"> and Docker wide used</span> * Many monitoring solutions (infra, HW, Serivces, etc) * Preparations for migration to IPv6 * WAN connectivity upgrade in many sites * <span style="background-color: transparent;"> __Storage__ </span> * <span style="background-color: transparent;">CephFS</span><span style="background-color: transparent;"> presentations by Australia (geo distributed)</span> * HA dCache presentation * __Computing & Batch__ * HTCondor (support slurm, improved <span style="background-color: transparent;">OpenStack</span><span style="background-color: transparent;"> AWS, container)</span> * __Facilities__ * CERN <span style="background-color: transparent;">OpenCompute</span><span style="background-color: transparent;"> project, still not performing so good (too early)</span> * New Data Centers at CERN (Green Cube 2020) * __Basic IT__ * Puppet in many sites, also thinking migration to v4 * Many ELK stack deployed * __Cloud__ * Container orchestration at RAL * NERSC HPC resources: Shifter (now open-source), Burst buffer (dynamic allocation of high-performance filesystems) </div> ---+++++ *System* * closing the site for CVE-2016-5195 on 1024-1026: we waited for the patched kernel to be released and at the same time we had been working on the new scratch FS * all machines patched as soon as the new kernel was available * job slots re-enabled gradually after the maintenance * new scratch FS mounted on arc[02,03] while the old one was put on drain; arc01 still using the original scratch FS; * Working on CMS vo box ---+++++ *Storage* <br /><strong>dCache</strong> * Production: stable, updated to the latest 2.10 patch * PreProduction: updated to 2.13; working on some gfal-copy problems. * Production update scheduled for the first week of December 2016 * *GPFS* * Performance issues on Krusty02, now restored * Arc01 jobs -> phoenix_scratch * Arc02-03 jobs -> new_phoenix_scratch --> DDN SFA12K, tested up to 6GB/s, limited by the number of servers (4). * During the dCache maintenance will move to 8 servers and review the results. ---+++ PSI * Converting our old 6 UIs in 6 WNs * each featuring [ 100GB RAM, 32 CPU cores, 4*1TB 7.2k disks, 2*1GbE ] * Installed a 8 SAS 12Gbs ports [[http://www.netapp.com/us/products/storage-systems/e2700/e2700-tech-specs.aspx][NetApp E2760]] * [ 52*6TB disks + 8*400GB SSD ( cache ) ], 2 RAID controller SAS based * Final net capacity ~200TB to be used for *dCache* * The 52*6TB DDP pool can tolerate 4 broken disks * [[https://www.netapp.com/us/media/ds-3395.pdf][NetApp SANtricity SSD Cache]] * [[https://www.netapp.com/us/media/ds-3309.pdf][NetApp SANtricity Dynamic Disk Pools ( DDP )]] * This is a epochal change about how we run RAIDs * [[CmsTier3/SGIIS5500andE5460andE2760#NetApp_E2760_312TB_raw]] * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]] ---+++ UNIBE-LHEP * Routine operation up to shutdown for <span style="background-color: transparent;">CVE-2016-5195.</span> * <span style="background-color: transparent;">Downtime was ill-declared (by me) so the site was not taken offline and this had an impact on the measured efficiency (blackhole too).</span> * <span style="background-color: transparent;">Infrastructure intervention during and following the downtime, running at reduced capacity for several days.</span> * <span style="background-color: transparent;">Firewall issue for ce04 (cloud) following the downtime: unavailable for a couple of weeks</span> * <span style="background-color: transparent;">Preparing for campus-wide power cut on 29-30 Nov.</span> * <span style="background-color: transparent;"> *Hammerclous status:* </span> <span style="background-color: transparent;">http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistory?columnid=562#time=custom&start_date=2016-10-01&end_date=2016-10-31&values=false&spline=false&debug=false&resample=false&sites=multiple&clouds=all&site=ANALY_CSCS,ANALY_UNIBE-LHEP,ANALY_UNIBE-LHEP-UBELIX,CSCS-LCG2,CSCS-LCG2_MCORE,UNIBE-LHEP,UNIBE-LHEP_MCORE,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE</span> * <strong> Accounting numbers (from scheduler) </strong>from last month (core-hours October 2016): <span style="background-color: transparent;"><span style="white-space: pre;"> </span>ATLAS: 933809; T2K: </span><span style="background-color: transparent;">10227; OPS: 31</span> <span style="background-color: transparent;"><br /></span> * *Accounting numbers from ATLAS dashboard* from last month (core-hours October 2016) [1],[2]: <span style="background-color: transparent;"><span style="white-space: pre;"> </span>CSCS / UNIBE 57% / 43% - </span><span style="background-color: transparent;">1575861 / </span><span style="background-color: transparent;">1185039 (reduced capacity at UNIBE after downtime)</span> * *Efficiency WT ok/fail* [3]: <span style="background-color: transparent;"><span style="white-space: pre;"> </span>CSCS/UNIBE 69.71/53.58 (bad downtime for UNIBE)</span> <span style="background-color: transparent;"><br /></span> * *CPU/WT efficiency* [4]: <span style="background-color: transparent;"><span style="white-space: pre;"> </span>CSCS/UNIBE 0.53/0.72 (CSCS recovers following downtime and GPFS fix):</span> [1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=ewa [2] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=wab [3] http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&sortBy=0&granularity=8%20Hours&generic=0&series=All&type=ebwc [4] http://dashb-atlas-job.cern.ch/dashboard/request.py/efficiency_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=eal ---+++ UNIBE-ID * Xxx ---+++ UNIGE * Operations * Old User Interfaces (UIs) with SLC5 moved to batch as Worker Nodes (16 cores x 3 old UIs = 48 cores) * Currently, UniGe-DPNC has around 800 cores in the batch for local users and ATLAS Grid production * Some issues with accounting by checking the ATLAS dashboard * In general: Running smoothly and increasing the usage of the cluster along time by local DPNC users and ATLAS Grid production * Storage * Getting short of space due to other DPNC local groups using the Grid storage. Neeed to apply some cleaning of old data * ATLAS DDM blacklist for [[http://atlas-agis.cern.ch/agis/ddmblacklisting/list/][TRIG-DAQ]] SpaceToken, although there is free space * Probably due to the reduction of space for ATLASGROUPDISK SpaceToken, since I moved some space. I should check it out * Currently, decreased from 25 TB to 20 TB * Accounting: * <span style="color: blue; background-color: transparent; text-decoration: underline;">[[https://wiki.chipp.ch/twiki/pub/LCGTier2/MeetingSwissGridOperations20161111/g07.201610.log][Accounting numbers (from scheduler) from last month]]</span> ---+++ NGI_CH * Funding for NGI_CH liaiason role (operation manager, security officer, etc) runs out by end of year. * Possible scenario: 15k/y provided by the CHIPP CB institutes. Bern via LHEP or the Scientific IT Support unit to provide the service (as now). * Any alternative proposal: please reply to e-mail thread. * NGI-CH Open Tickets review: <span style="background-color: transparent;">https://ggus.eu/index.php?mode=ticket_search&show_columns_check%5B%5D=TICKET_TYPE&show_columns_check%5B%5D=AFFECTED_VO&show_columns_check%5B%5D=AFFECTED_SITE&show_columns_check%5B%5D=PRIORITY&show_columns_check%5B%5D=RESPONSIBLE_UNIT&show_columns_check%5B%5D=STATUS&show_columns_check%5B%5D=DATE_OF_CHANGE&show_columns_check%5B%5D=SHORT_DESCRIPTION&ticket_id=&supportunit=NGI_CH&su_hierarchy=0&vo=&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=&specattrib=none&status=open&priority=&typeofproblem=&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=06+May+2014&to_date=07+May+2014&untouched_date=&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO%21</span> AFS related: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124818">124818</a> (PSI) in progress, <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124815">124815</a> (UZH) contacted UZH to check if site obsolete-> could deactivate it in GOCDB ATLAS CSCS: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124719">124719</a> (squid down) needs a restart on atlas01 DINO: squid started. ATLAS UNIBE: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124518">124518</a> (higer than normal failure rate at Ubelix). Main cause of failure fixed, dealing with some job timeouts now ATLAS UNIBE: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> (storage dumps) on hold CMS CSCS: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124714">124714</a> (jobs not running) fixed? Accounting: CSCS: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=123765">123765</a> (cream accounting): needs action from CSCS - UNIBE: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124320">124320</a> (not publishing) actions carried out, must check back the status ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: * CMS: Fabio * ATLAS: Gianfranco: apologies, Luis * LHCb: * EGI: ---++ Action items * Item1
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
log
g07.201610.log
r1
manage
1.1 K
2016-11-11 - 12:47
LuisMarch
UniGe
-DPNC accounting - October 2016
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingSwissGridOperations20161111
Topic revision: r10 - 2016-11-11 - DinoConciatore
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback