Tags:
meeting
1
SwissGridOperationsMeeting
1
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2016-11-11 at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598) * *External link*: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE * *Phone gate*: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) * *Switch Vidyo SIP IP*: 137.138.248.204 %TOC% ---++ Site status ---+++ CSCS ---+++++ *Quick report HEPiX fall 2016 (first time at HEPiX)* <div style="background-color: transparent;" id="_mcePaste"> * Around 100 partecipants * __Running HEP Workloads on the NERSC HPC (Tony Quan presentation)__ * Specs: * Cori phase 1 (1630 Haswel nodes) * Cori phase 2 (9399 Knights nodes) * Different cvmfs approach, not working for us. (We are using it native on nodes) * <span style="background-color: transparent;"> __A lot of site reports (2 full days)__ </span> * GPFS, Hadoop, Dropbox (CERNBox at CERN) used by many sites as storage solutions * Lustre widely used * Starting HPC and HTC integration activities * <span style="background-color: transparent;">OpenStack</span><span style="background-color: transparent;"> and Docker wide used</span> * Many monitoring solutions (infra, HW, Serivces, etc) * Preparations for migration to IPv6 * WAN connectivity upgrade in many sites * <span style="background-color: transparent;"> __Storage__ </span> * <span style="background-color: transparent;">CephFS</span><span style="background-color: transparent;"> presentations by Australia (geo distributed)</span> * HA dCache presentation * __Computing & Batch__ * HTCondor (support slurm, improved <span style="background-color: transparent;">OpenStack</span><span style="background-color: transparent;"> AWS, container)</span> * __Facilities__ * CERN <span style="background-color: transparent;">OpenCompute</span><span style="background-color: transparent;"> project, still not performing so good (too early)</span> * New Data Centers at CERN (Green Cube 2020) * __Basic IT__ * Puppet in many sites, also thinking migration to v4 * Many ELK stack deployed * __Cloud__ * Container orchestration at RAL * NERSC HPC resources: Shifter (now open-source), Burst buffer (dynamic allocation of high-performance filesystems) </div> ---+++++ *System* * closing the site for CVE-2016-5195 on 1024-1026: we waited for the patched kernel to be released and at the same time we had been working on the new scratch FS * all machines patched as soon as the new kernel was available * job slots re-enabled gradually after the maintenance * new scratch FS mounted on arc[02,03] while the old one was put on drain; arc01 still using the original scratch FS; * Working on CMS vo box ---+++++ *Storage* <br /><strong>dCache</strong> * Production: stable, updated to the latest 2.10 patch * PreProduction: updated to 2.13; working on some gfal-copy problems. * Production update scheduled for the first week of December 2016 * *GPFS* * Performance issues on Krusty02, now restored * Arc01 jobs -> phoenix_scratch * Arc02-03 jobs -> new_phoenix_scratch --> DDN SFA12K, tested up to 6GB/s, limited by the number of servers (4). * During the dCache maintenance will move to 8 servers and review the results. ---+++ PSI * Converting our old 6 UIs in 6 WNs * each featuring [ 100GB RAM, 32 CPU cores, 4*1TB 7.2k disks, 2*1GbE ] * Installed a 8 SAS 12Gbs ports [[http://www.netapp.com/us/products/storage-systems/e2700/e2700-tech-specs.aspx][NetApp E2760]] * [ 52*6TB disks + 8*400GB SSD ( cache ) ], 2 RAID controller SAS based * Final net capacity ~200TB to be used for *dCache* * The 52*6TB DDP pool can tolerate 4 broken disks * [[https://www.netapp.com/us/media/ds-3395.pdf][NetApp SANtricity SSD Cache]] * [[https://www.netapp.com/us/media/ds-3309.pdf][NetApp SANtricity Dynamic Disk Pools ( DDP )]] * This is a epochal change about how we run RAIDs * [[CmsTier3/SGIIS5500andE5460andE2760#NetApp_E2760_312TB_raw]] * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]] ---+++ UNIBE-LHEP * Routine operation up to shutdown for <span style="background-color: transparent;">CVE-2016-5195.</span> * <span style="background-color: transparent;">Downtime was ill-declared (by me) so the site was not taken offline and this had an impact on the measured efficiency (blackhole too).</span> * <span style="background-color: transparent;">Infrastructure intervention during and following the downtime, running at reduced capacity for several days.</span> * <span style="background-color: transparent;">Firewall issue for ce04 (cloud) following the downtime: unavailable for a couple of weeks</span> * <span style="background-color: transparent;">Preparing for campus-wide power cut on 29-30 Nov.</span> * <span style="background-color: transparent;"> *Hammerclous status:* </span> <span style="background-color: transparent;">http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistory?columnid=562#time=custom&start_date=2016-10-01&end_date=2016-10-31&values=false&spline=false&debug=false&resample=false&sites=multiple&clouds=all&site=ANALY_CSCS,ANALY_UNIBE-LHEP,ANALY_UNIBE-LHEP-UBELIX,CSCS-LCG2,CSCS-LCG2_MCORE,UNIBE-LHEP,UNIBE-LHEP_MCORE,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE</span> * <strong> Accounting numbers (from scheduler) </strong>from last month (core-hours October 2016): <span style="background-color: transparent;"><span style="white-space: pre;"> </span>ATLAS: 933809; T2K: </span><span style="background-color: transparent;">10227; OPS: 31</span> <span style="background-color: transparent;"><br /></span> * *Accounting numbers from ATLAS dashboard* from last month (core-hours October 2016) [1],[2]: <span style="background-color: transparent;"><span style="white-space: pre;"> </span>CSCS / UNIBE 57% / 43% - </span><span style="background-color: transparent;">1575861 / </span><span style="background-color: transparent;">1185039 (reduced capacity at UNIBE after downtime)</span> * *Efficiency WT ok/fail* [3]: <span style="background-color: transparent;"><span style="white-space: pre;"> </span>CSCS/UNIBE 69.71/53.58 (bad downtime for UNIBE)</span> <span style="background-color: transparent;"><br /></span> * *CPU/WT efficiency* [4]: <span style="background-color: transparent;"><span style="white-space: pre;"> </span>CSCS/UNIBE 0.53/0.72 (CSCS recovers following downtime and GPFS fix):</span> [1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=ewa [2] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=wab [3] http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&sortBy=0&granularity=8%20Hours&generic=0&series=All&type=ebwc [4] http://dashb-atlas-job.cern.ch/dashboard/request.py/efficiency_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=eal ---+++ UNIBE-ID * Xxx ---+++ UNIGE * Operations * Old User Interfaces (UIs) with SLC5 moved to batch as Worker Nodes (16 cores x 3 old UIs = 48 cores) * Currently, UniGe-DPNC has around 800 cores in the batch for local users and ATLAS Grid production * Some issues with accounting by checking the ATLAS dashboard * In general: Running smoothly and increasing the usage of the cluster along time by local DPNC users and ATLAS Grid production * Storage * Getting short of space due to other DPNC local groups using the Grid storage. Neeed to apply some cleaning of old data * ATLAS DDM blacklist for [[http://atlas-agis.cern.ch/agis/ddmblacklisting/list/][TRIG-DAQ]] SpaceToken, although there is free space * Probably due to the reduction of space for ATLASGROUPDISK SpaceToken, since I moved some space. I should check it out * Currently, decreased from 25 TB to 20 TB * Accounting: * <span style="color: blue; background-color: transparent; text-decoration: underline;">[[https://wiki.chipp.ch/twiki/pub/LCGTier2/MeetingSwissGridOperations20161111/g07.201610.log][Accounting numbers (from scheduler) from last month]]</span> ---+++ NGI_CH * Funding for NGI_CH liaiason role (operation manager, security officer, etc) runs out by end of year. * Possible scenario: 15k/y provided by the CHIPP CB institutes. Bern via LHEP or the Scientific IT Support unit to provide the service (as now). * Any alternative proposal: please reply to e-mail thread. * NGI-CH Open Tickets review: <span style="background-color: transparent;">https://ggus.eu/index.php?mode=ticket_search&show_columns_check%5B%5D=TICKET_TYPE&show_columns_check%5B%5D=AFFECTED_VO&show_columns_check%5B%5D=AFFECTED_SITE&show_columns_check%5B%5D=PRIORITY&show_columns_check%5B%5D=RESPONSIBLE_UNIT&show_columns_check%5B%5D=STATUS&show_columns_check%5B%5D=DATE_OF_CHANGE&show_columns_check%5B%5D=SHORT_DESCRIPTION&ticket_id=&supportunit=NGI_CH&su_hierarchy=0&vo=&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=&specattrib=none&status=open&priority=&typeofproblem=&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=06+May+2014&to_date=07+May+2014&untouched_date=&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO%21</span> AFS related: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124818">124818</a> (PSI) in progress, <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124815">124815</a> (UZH) contacted UZH to check if site obsolete-> could deactivate it in GOCDB ATLAS CSCS: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124719">124719</a> (squid down) needs a restart on atlas01 DINO: squid started. ATLAS UNIBE: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124518">124518</a> (higer than normal failure rate at Ubelix). Main cause of failure fixed, dealing with some job timeouts now ATLAS UNIBE: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=117899">117899</a> (storage dumps) on hold CMS CSCS: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124714">124714</a> (jobs not running) fixed? Accounting: CSCS: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=123765">123765</a> (cream accounting): needs action from CSCS - UNIBE: <a target="_blank" href="https://ggus.eu/index.php?mode=ticket_info&ticket_id=124320">124320</a> (not publishing) actions carried out, must check back the status ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: * CMS: Fabio * ATLAS: Gianfranco: apologies, Luis * LHCb: * EGI: ---++ Action items * Item1
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
log
g07.201610.log
r1
manage
1.1 K
2016-11-11 - 12:47
LuisMarch
UniGe
-DPNC accounting - October 2016
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r10
<
r9
<
r8
<
r7
<
r6
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r10 - 2016-11-11
-
DinoConciatore
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback