Tags: view all tags

Swiss Grid Operations Meeting on 2016-11-11 at 14:00

Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598)
External link: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE
Phone gate: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
Switch Vidyo SIP IP: 137.138.248.204

Swiss Grid Operations Meeting on 2016-11-11 at 14:00

Site status

CSCS

Quick report HEPiX fall 2016 (first time at HEPiX)

Around 100 partecipants
Running HEP Workloads on the NERSC HPC (Tony Quan presentation)
- Specs:
  - Cori phase 1 (1630 Haswel nodes)
  - Cori phase 2 (9399 Knights nodes)
- Different cvmfs approach, not working for us. (We are using it native on nodes)
A lot of site reports (2 full days)
- GPFS, Hadoop, Dropbox (CERNBox at CERN) used by many sites as storage solutions
- Lustre widely used
- Starting HPC and HTC integration activities
- OpenStack and Docker wide used
- Many monitoring solutions (infra, HW, Serivces, etc)
- Preparations for migration to IPv6
- WAN connectivity upgrade in many sites
Storage
- CephFS presentations by Australia (geo distributed)
- HA dCache presentation
Computing & Batch
- HTCondor (support slurm, improved OpenStack AWS, container)
Facilities
- CERN OpenCompute project, still not performing so good (too early)
- New Data Centers at CERN (Green Cube 2020)
Basic IT
- Puppet in many sites, also thinking migration to v4
- Many ELK stack deployed
Cloud
- Container orchestration at RAL
- NERSC HPC resources: Shifter (now open-source), Burst buffer (dynamic allocation of high-performance filesystems)

System

closing the site for CVE-2016-5195 on 1024-1026: we waited for the patched kernel to be released and at the same time we had been working on the new scratch FS
all machines patched as soon as the new kernel was available
job slots re-enabled gradually after the maintenance
new scratch FS mounted on arc[02,03] while the old one was put on drain; arc01 still using the original scratch FS;
Working on CMS vo box

Storage

dCache

Production: stable, updated to the latest 2.10 patch
PreProduction: updated to 2.13; working on some gfal-copy problems.
Production update scheduled for the first week of December 2016

GPFS

Performance issues on Krusty02, now restored
Arc01 jobs -> phoenix_scratch
Arc02-03 jobs -> new_phoenix_scratch --> DDN SFA12K, tested up to 6GB/s, limited by the number of servers (4).
During the dCache maintenance will move to 8 servers and review the results.

PSI

Converting our old 6 UIs in 6 WNs
- each featuring [ 100GB RAM, 32 CPU cores, 4*1TB 7.2k disks, 2*1GbE ]
Installed a 8 SAS 12Gbs ports NetApp E2760
- [ 52*6TB disks + 8*400GB SSD ( cache ) ], 2 RAID controller SAS based
- Final net capacity ~200TB to be used for dCache
- The 52*6TB DDP pool can tolerate 4 broken disks
- NetApp SANtricity SSD Cache
- NetApp SANtricity Dynamic Disk Pools ( DDP )
- This is a epochal change about how we run RAIDs
- CmsTier3/SGIIS5500andE5460andE2760#NetApp_E2760_312TB_raw
Accounting numbers (from scheduler) from last month

UNIBE-LHEP

Routine operation up to shutdown for CVE-2016-5195.
Downtime was ill-declared (by me) so the site was not taken offline and this had an impact on the measured efficiency (blackhole too).
Infrastructure intervention during and following the downtime, running at reduced capacity for several days.
Firewall issue for ce04 (cloud) following the downtime: unavailable for a couple of weeks
Preparing for campus-wide power cut on 29-30 Nov.

Hammerclous status:

http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistory?columnid=562#time=custom&start_date=2016-10-01&end_date=2016-10-31&values=false&spline=false&debug=false&resample=false&sites=multiple&clouds=all&site=ANALY_CSCS,ANALY_UNIBE-LHEP,ANALY_UNIBE-LHEP-UBELIX,CSCS-LCG2,CSCS-LCG2_MCORE,UNIBE-LHEP,UNIBE-LHEP_MCORE,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE

Accounting numbers (from scheduler) from last month (core-hours October 2016):

ATLAS: 933809; T2K: 10227; OPS: 31

Accounting numbers from ATLAS dashboard from last month (core-hours October 2016) [1],[2]:

CSCS / UNIBE 57% / 43% - 1575861 / 1185039 (reduced capacity at UNIBE after downtime)

Efficiency WT ok/fail [3]:

CSCS/UNIBE 69.71/53.58 (bad downtime for UNIBE)

CPU/WT efficiency [4]:

CSCS/UNIBE 0.53/0.72 (CSCS recovers following downtime and GPFS fix):

[1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=ewa

[2] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=wab

[3] http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&sortBy=0&granularity=8%20Hours&generic=0&series=All&type=ebwc

[4] http://dashb-atlas-job.cern.ch/dashboard/request.py/efficiency_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=eal

UNIBE-ID

UNIGE

Operations
- Old User Interfaces (UIs) with SLC5 moved to batch as Worker Nodes (16 cores x 3 old UIs = 48 cores)
- Currently, UniGe-DPNC has around 800 cores in the batch for local users and ATLAS Grid production
- Some issues with accounting by checking the ATLAS dashboard
- In general: Running smoothly and increasing the usage of the cluster along time by local DPNC users and ATLAS Grid production
Storage
- Getting short of space due to other DPNC local groups using the Grid storage. Neeed to apply some cleaning of old data
- ATLAS DDM blacklist for TRIG-DAQ SpaceToken, although there is free space
  - Probably due to the reduction of space for ATLASGROUPDISK SpaceToken, since I moved some space. I should check it out
  - Currently, decreased from 25 TB to 20 TB
Accounting:
- Accounting numbers (from scheduler) from last month

NGI_CH

Funding for NGI_CH liaiason role (operation manager, security officer, etc) runs out by end of year.
Possible scenario: 15k/y provided by the CHIPP CB institutes. Bern via LHEP or the Scientific IT Support unit to provide the service (as now).
Any alternative proposal: please reply to e-mail thread.

NGI-CH Open Tickets review:

https://ggus.eu/index.php?mode=ticket_search&show_columns_check%5B%5D=TICKET_TYPE&show_columns_check%5B%5D=AFFECTED_VO&show_columns_check%5B%5D=AFFECTED_SITE&show_columns_check%5B%5D=PRIORITY&show_columns_check%5B%5D=RESPONSIBLE_UNIT&show_columns_check%5B%5D=STATUS&show_columns_check%5B%5D=DATE_OF_CHANGE&show_columns_check%5B%5D=SHORT_DESCRIPTION&ticket_id=&supportunit=NGI_CH&su_hierarchy=0&vo=&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=&specattrib=none&status=open&priority=&typeofproblem=&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=06+May+2014&to_date=07+May+2014&untouched_date=&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO%21

AFS related: 124818 (PSI) in progress, 124815 (UZH) contacted UZH to check if site obsolete-> could deactivate it in GOCDB

ATLAS CSCS: 124719 (squid down) needs a restart on atlas01

DINO: squid started.

ATLAS UNIBE: 124518 (higer than normal failure rate at Ubelix). Main cause of failure fixed, dealing with some job timeouts now

ATLAS UNIBE: 117899 (storage dumps) on hold

CMS CSCS: 124714 (jobs not running) fixed?

Accounting: CSCS: 123765 (cream accounting): needs action from CSCS - UNIBE: 124320 (not publishing) actions carried out, must check back the status