MeetingSwissGridOperations20161111 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2016-11-11 at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598)
   * *External link*: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
   * *Switch Vidyo SIP IP*: 137.138.248.204
%TOC%

---++ Site status
---+++ CSCS
---+++++ *Quick report HEPiX fall 2016 (first time at HEPiX)*
<div style="background-color: transparent;" id="_mcePaste">
   * Around 100 partecipants
   * __Running HEP Workloads on the NERSC HPC (Tony Quan presentation)__ 
      * Specs: 
         * Cori phase 1 (1630 Haswel nodes)
         * Cori phase 2 (9399 Knights nodes)
      * Different cvmfs approach, not working for us. (We are using it native on nodes)
   * <span style="background-color: transparent;"> __A lot of site reports (2 full days)__ </span> 
      * GPFS, Hadoop, Dropbox (CERNBox at CERN) used by many sites as storage solutions
      * Lustre widely used
      * Starting HPC and HTC integration activities
      * <span style="background-color: transparent;">OpenStack</span><span style="background-color: transparent;"> and Docker wide used</span>
      * Many monitoring solutions (infra, HW, Serivces, etc)
      * Preparations for migration to IPv6
      * WAN connectivity upgrade in many sites
   * <span style="background-color: transparent;"> __Storage__ </span> 
      * <span style="background-color: transparent;">CephFS</span><span style="background-color: transparent;"> presentations by Australia (geo distributed)</span>
      * HA dCache presentation
   * __Computing & Batch__ 
      * HTCondor (support slurm, improved <span style="background-color: transparent;">OpenStack</span><span style="background-color: transparent;"> AWS, container)</span>
   * __Facilities__ 
      * CERN <span style="background-color: transparent;">OpenCompute</span><span style="background-color: transparent;"> project, still not performing so good (too early)</span>
      * New Data Centers at CERN (Green Cube 2020)
   * __Basic IT__ 
      * Puppet in many sites, also thinking migration to v4
      * Many ELK stack deployed
   * __Cloud__ 
      * Container orchestration at RAL
      * NERSC HPC resources: Shifter (now open-source), Burst buffer (dynamic allocation of high-performance filesystems)
</div>
---+++++ *System*

   * closing the site for CVE-2016-5195 on 1024-1026: we waited for the patched kernel to be released and at the same time we had been working on the new scratch FS
   * all machines patched as soon as the new kernel was available
   * job slots re-enabled gradually after the maintenance
   * new scratch FS mounted on arc[02,03] while the old one was put on drain; arc01 still using the original scratch FS;
   * Working on CMS vo box
---+++++ *Storage*

<br /><strong>dCache</strong>
   * Production: stable, updated to the latest 2.10 patch
   * PreProduction: updated to 2.13; working on some gfal-copy problems.
   * Production update scheduled for the first week of December 2016
   * 
*GPFS*
   * Performance issues on Krusty02, now restored
   * Arc01 jobs -&gt; phoenix_scratch
   * Arc02-03 jobs -&gt; new_phoenix_scratch --&gt; DDN SFA12K, tested up to 6GB/s, limited by the number of servers (4).
   * During the dCache maintenance will move to 8 servers and review the results.
---+++ PSI

   * Converting our old 6 UIs in 6 WNs 
      * each featuring [ 100GB RAM, 32 CPU cores, 4*1TB 7.2k disks, 2*1GbE ]
   * Installed a 8 SAS 12Gbs ports [[http://www.netapp.com/us/products/storage-systems/e2700/e2700-tech-specs.aspx][NetApp E2760]] 
      * [ 52*6TB disks + 8*400GB SSD ( cache ) ], 2 RAID controller SAS based
      * Final net capacity ~200TB to be used for *dCache*
      * The 52*6TB DDP pool can tolerate 4 broken disks
      * [[https://www.netapp.com/us/media/ds-3395.pdf][NetApp SANtricity SSD Cache]]
      * [[https://www.netapp.com/us/media/ds-3309.pdf][NetApp SANtricity Dynamic Disk Pools ( DDP )]]
      * This is a epochal change about how we run RAIDs
      * [[CmsTier3/SGIIS5500andE5460andE2760#NetApp_E2760_312TB_raw]]
   * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]]

---+++ UNIBE-LHEP
   * Routine operation up to shutdown for <span style="background-color: transparent;">CVE-2016-5195.</span>
   * <span style="background-color: transparent;">Downtime was ill-declared (by me) so the site was not taken offline and this had an impact on the measured efficiency (blackhole too).</span>
   * <span style="background-color: transparent;">Infrastructure intervention during and following the downtime, running at reduced capacity for several days.</span>
   * <span style="background-color: transparent;">Firewall issue for ce04 (cloud) following the downtime: unavailable for a couple of weeks</span>
   * <span style="background-color: transparent;">Preparing for campus-wide power cut on 29-30 Nov.</span>

   * <span style="background-color: transparent;"> *Hammerclous status:* </span>
<span style="background-color: transparent;">http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistory?columnid=562#time=custom&start_date=2016-10-01&end_date=2016-10-31&values=false&spline=false&debug=false&resample=false&sites=multiple&clouds=all&site=ANALY_CSCS,ANALY_UNIBE-LHEP,ANALY_UNIBE-LHEP-UBELIX,CSCS-LCG2,CSCS-LCG2_MCORE,UNIBE-LHEP,UNIBE-LHEP_MCORE,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE</span>
   * <strong> Accounting numbers (from scheduler) </strong>from last month (core-hours October 2016):
<span style="background-color: transparent;"><span style="white-space: pre;"> </span>ATLAS: 933809; T2K: </span><span style="background-color: transparent;">10227; OPS: 31</span>

<span style="background-color: transparent;"><br /></span>
   * *Accounting numbers from ATLAS dashboard* from last month (core-hours October 2016) [1],[2]:
<span style="background-color: transparent;"><span style="white-space: pre;"> </span>CSCS / UNIBE 57% / 43% - </span><span style="background-color: transparent;">1575861 / </span><span style="background-color: transparent;">1185039 (reduced capacity at UNIBE after downtime)</span>

   * *Efficiency WT ok/fail* [3]:
<span style="background-color: transparent;"><span style="white-space: pre;"> </span>CSCS/UNIBE 69.71/53.58 (bad downtime for UNIBE)</span>

<span style="background-color: transparent;"><br /></span>
   * *CPU/WT efficiency* [4]:
<span style="background-color: transparent;"><span style="white-space: pre;"> </span>CSCS/UNIBE 0.53/0.72 (CSCS recovers following downtime and GPFS fix):</span>

[1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=ewa

[2] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=wab

[3] http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&sortBy=0&granularity=8%20Hours&generic=0&series=All&type=ebwc

[4] http://dashb-atlas-job.cern.ch/dashboard/request.py/efficiency_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=eal

---+++ UNIBE-ID
   * Xxx

---+++ UNIGE
   * Operations 
      * Old User Interfaces (UIs) with SLC5 moved to batch as Worker Nodes (16 cores x 3 old UIs = 48 cores)
      * Currently, UniGe-DPNC has around 800 cores in the batch for local users and ATLAS Grid production
      * Some issues with accounting by checking the ATLAS dashboard
      * In general: Running smoothly and increasing the usage of the cluster along time by local DPNC users and ATLAS Grid production
   * Storage 
      * Getting short of space due to other DPNC local groups using the Grid storage. Neeed to apply some cleaning of old data
      * ATLAS DDM blacklist for [[http://atlas-agis.cern.ch/agis/ddmblacklisting/list/][TRIG-DAQ]] SpaceToken, although there is free space 
         * Probably due to the reduction of space for ATLASGROUPDISK SpaceToken, since I moved some space. I should check it out
         * Currently, decreased from 25 TB to 20 TB
   * Accounting: 
      * <span style="color: blue; background-color: transparent; text-decoration: underline;">[[https://wiki.chipp.ch/twiki/pub/LCGTier2/MeetingSwissGridOperations20161111/g07.201610.log][Accounting numbers (from scheduler) from last month]]</span>

---+++ NGI_CH
   * Funding for NGI_CH liaiason role (operation manager, security officer, etc) runs out by end of year.
   * Possible scenario: 15k/y provided by the CHIPP CB institutes. Bern via LHEP or the Scientific IT Support unit to provide the service (as now).
   * Any alternative proposal: please reply to e-mail thread.

   * NGI-CH Open Tickets review:
<span style="background-color: transparent;">https://ggus.eu/index.php?mode=ticket_search&show_columns_check%5B%5D=TICKET_TYPE&show_columns_check%5B%5D=AFFECTED_VO&show_columns_check%5B%5D=AFFECTED_SITE&show_columns_check%5B%5D=PRIORITY&show_columns_check%5B%5D=RESPONSIBLE_UNIT&show_columns_check%5B%5D=STATUS&show_columns_check%5B%5D=DATE_OF_CHANGE&show_columns_check%5B%5D=SHORT_DESCRIPTION&ticket_id=&supportunit=NGI_CH&su_hierarchy=0&vo=&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=&specattrib=none&status=open&priority=&typeofproblem=&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=06+May+2014&to_date=07+May+2014&untouched_date=&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO%21</span>

AFS related: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124818">124818</a> (PSI) in progress, <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124815">124815</a> (UZH) contacted UZH to check if site obsolete-&gt; could deactivate it in GOCDB

ATLAS CSCS: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124719">124719</a> (squid down) needs a restart on atlas01

DINO: squid started.

ATLAS UNIBE: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124518">124518</a> (higer than normal failure rate at Ubelix). Main cause of failure fixed, dealing with some job timeouts now

ATLAS UNIBE: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> (storage dumps) on hold

CMS CSCS: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124714">124714</a> (jobs not running) fixed?

Accounting: CSCS: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=123765">123765</a> (cream accounting): needs action from CSCS - UNIBE: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124320">124320</a> (not publishing) actions carried out, must check back the status

---++ Other topics
   * Topic1
   * Topic2
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS:
   * CMS: Fabio
   * ATLAS: Gianfranco: apologies, Luis
   * LHCb:
   * EGI:

---++ Action items
   * Item1