Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2016-11-11 at 14:00

Site status

CSCS

Quick report HEPiX fall 2016 (first time at HEPiX)
  • Around 100 partecipants
  • Running HEP Workloads on the NERSC HPC (Tony Quan presentation)
    • Specs:
      • Cori phase 1 (1630 Haswel nodes)
      • Cori phase 2 (9399 Knights nodes)
    • Different cvmfs approach, not working for us. (We are using it native on nodes)
  • A lot of site reports (2 full days)
    • GPFS, Hadoop, Dropbox (CERNBox at CERN) used by many sites as storage solutions
    • Lustre widely used
    • Starting HPC and HTC integration activities
    • OpenStack and Docker wide used
    • Many monitoring solutions (infra, HW, Serivces, etc)
    • Preparations for migration to IPv6
    • WAN connectivity upgrade in many sites
  • Storage
    • CephFS presentations by Australia (geo distributed)
    • HA dCache presentation
  • Computing & Batch
    • HTCondor (support slurm, improved OpenStack AWS, container)
  • Facilities
    • CERN OpenCompute project, still not performing so good (too early)
    • New Data Centers at CERN (Green Cube 2020)
  • Basic IT
    • Puppet in many sites, also thinking migration to v4
    • Many ELK stack deployed
  • Cloud
    • Container orchestration at RAL
    • NERSC HPC resources: Shifter (now open-source), Burst buffer (dynamic allocation of high-performance filesystems)
System

  • closing the site for CVE-2016-5195 on 1024-1026: we waited for the patched kernel to be released and at the same time we had been working on the new scratch FS
  • all machines patched as soon as the new kernel was available
  • job slots re-enabled gradually after the maintenance
  • new scratch FS mounted on arc[02,03] while the old one was put on drain; arc01 still using the original scratch FS;
  • Working on CMS vo box
Storage


dCache

  • Production: stable, updated to the latest 2.10 patch
  • PreProduction: updated to 2.13; working on some gfal-copy problems.
  • Production update scheduled for the first week of December 2016
GPFS
  • Performance issues on Krusty02, now restored
  • Arc01 jobs -> phoenix_scratch
  • Arc02-03 jobs -> new_phoenix_scratch --> DDN SFA12K, tested up to 6GB/s, limited by the number of servers (4).
  • During the dCache maintenance will move to 8 servers and review the results.

PSI

UNIBE-LHEP

  • Routine operation up to shutdown for CVE-2016-5195.
  • Downtime was ill-declared (by me) so the site was not taken offline and this had an impact on the measured efficiency (blackhole too).
  • Infrastructure intervention during and following the downtime, running at reduced capacity for several days.
  • Firewall issue for ce04 (cloud) following the downtime: unavailable for a couple of weeks
  • Preparing for campus-wide power cut on 29-30 Nov.

  • Hammerclous status:
http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistory?columnid=562#time=custom&start_date=2016-10-01&end_date=2016-10-31&values=false&spline=false&debug=false&resample=false&sites=multiple&clouds=all&site=ANALY_CSCS,ANALY_UNIBE-LHEP,ANALY_UNIBE-LHEP-UBELIX,CSCS-LCG2,CSCS-LCG2_MCORE,UNIBE-LHEP,UNIBE-LHEP_MCORE,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE
  • Accounting numbers (from scheduler) from last month (core-hours October 2016):
ATLAS: 933809; T2K: 10227; OPS: 31


  • Accounting numbers from ATLAS dashboard from last month (core-hours October 2016) [1],[2]:
CSCS / UNIBE 57% / 43% - 1575861 / 1185039 (reduced capacity at UNIBE after downtime)

  • Efficiency WT ok/fail [3]:
CSCS/UNIBE 69.71/53.58 (bad downtime for UNIBE)


  • CPU/WT efficiency [4]:
CSCS/UNIBE 0.53/0.72 (CSCS recovers following downtime and GPFS fix):

[1] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=ewa

[2] http://dashb-atlas-job.cern.ch/dashboard/request.py/consumptions_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=wab

[3] http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&sortBy=0&granularity=8%20Hours&generic=0&series=All&type=ebwc

[4] http://dashb-atlas-job.cern.ch/dashboard/request.py/efficiency_individual?sites=CSCS-LCG2&sites=UNIBE-LHEP&sitesCat=All%20Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-10-01&end=2016-10-31&timeRange=daily&granularity=8%20Hours&generic=0&sortBy=0&series=All&type=eal

UNIBE-ID

  • Xxx

UNIGE

  • Operations
    • Old User Interfaces (UIs) with SLC5 moved to batch as Worker Nodes (16 cores x 3 old UIs = 48 cores)
    • Currently, UniGe-DPNC has around 800 cores in the batch for local users and ATLAS Grid production
    • Some issues with accounting by checking the ATLAS dashboard
    • In general: Running smoothly and increasing the usage of the cluster along time by local DPNC users and ATLAS Grid production
  • Storage
    • Getting short of space due to other DPNC local groups using the Grid storage. Neeed to apply some cleaning of old data
    • ATLAS DDM blacklist for TRIG-DAQ SpaceToken, although there is free space
      • Probably due to the reduction of space for ATLASGROUPDISK SpaceToken, since I moved some space. I should check it out
      • Currently, decreased from 25 TB to 20 TB
  • Accounting:

NGI_CH

  • Funding for NGI_CH liaiason role (operation manager, security officer, etc) runs out by end of year.
  • Possible scenario: 15k/y provided by the CHIPP CB institutes. Bern via LHEP or the Scientific IT Support unit to provide the service (as now).
  • Any alternative proposal: please reply to e-mail thread.

  • NGI-CH Open Tickets review:
https://ggus.eu/index.php?mode=ticket_search&show_columns_check%5B%5D=TICKET_TYPE&show_columns_check%5B%5D=AFFECTED_VO&show_columns_check%5B%5D=AFFECTED_SITE&show_columns_check%5B%5D=PRIORITY&show_columns_check%5B%5D=RESPONSIBLE_UNIT&show_columns_check%5B%5D=STATUS&show_columns_check%5B%5D=DATE_OF_CHANGE&show_columns_check%5B%5D=SHORT_DESCRIPTION&ticket_id=&supportunit=NGI_CH&su_hierarchy=0&vo=&user=&keyword=&involvedsupporter=&assignedto=&affectedsite=&specattrib=none&status=open&priority=&typeofproblem=&ticket_category=all&mouarea=&date_type=creation+date&tf_radio=1&timeframe=any&from_date=06+May+2014&to_date=07+May+2014&untouched_date=&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO%21

AFS related: 124818 (PSI) in progress, 124815 (UZH) contacted UZH to check if site obsolete-> could deactivate it in GOCDB

ATLAS CSCS: 124719 (squid down) needs a restart on atlas01

DINO: squid started.

ATLAS UNIBE: 124518 (higer than normal failure rate at Ubelix). Main cause of failure fixed, dealing with some job timeouts now

ATLAS UNIBE: 117899 (storage dumps) on hold

CMS CSCS: 124714 (jobs not running) fixed?

Accounting: CSCS: 123765 (cream accounting): needs action from CSCS - UNIBE: 124320 (not publishing) actions carried out, must check back the status

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS:
  • CMS: Fabio
  • ATLAS: Gianfranco: apologies, Luis
  • LHCb:
  • EGI:

Action items

  • Item1
Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatlog g07.201610.log r1 manage 1.1 K 2016-11-11 - 12:47 LuisMarch UniGe-DPNC accounting - October 2016
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r10 - 2016-11-11 - DinoConciatore
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback