Swiss Grid Operations Meeting on 2016-08-04 at 14:00

Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598)
External link: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE
Phone gate: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
Switch Vidyo SIP IP: 137.138.248.204

Swiss Grid Operations Meeting on 2016-08-04 at 14:00
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

* Xxx * Accounting numbers (from scheduler) from last month * Worked mainly on the issue about the GPFS slowness and lcb-cp problem

- GPFS Slowness is caused by I/O intensive jobs running simultaneously
- LCB-CP deprecated command replaced by gfal-copy, changed on site conf by CMS and Atlas
  - lhcb is facing the same issue?
Perfsonar01/02 dead for disc failure, both machines reinstalled with Puppet
cream[01-03] removed yesterday from BDII and GOCDB, so officially decommissioned. Cream01 and cream03 powerd off today
Reintalling BDII with puppet

Accounting numbers July:

VO	Cpu Hours
cms	1'793'900.165
atlas	1'118'498.575
lhcb	811'097.677
ops	19.319
TOTAL	3'723'519.013

PSI

Accounting numbers (from scheduler) from last month
New HW
- 3 Dalco UI
  - each featuring [ 128GB RAM, 2 * E5-2697v4 CPUs, 6*1.8TB 10k disks, 2*10GbE ]
- 1 Storage, type NetApp E2760
  - [ 52*6TB disks + 8*400GB SSD ], 2 RAID controller SAS based
  - final net capacity ~200TB
  - NetApp SANtricity SSD Cache
  - NetApp SANtricity Dynamic Disk Pools

GGUS Tickets vs CSCS

Following Failures at T2_CH_CSCS
CMS Job gfal-copy call activated because of my recent change command value="gfal2"

$ find /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/ /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/ /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/PhEDEx /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/PhEDEx/storage.xml /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/JobConfig /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/JobConfig/cmsset_local.sh /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/JobConfig/cmsset_local.csh /cvmfs/cms.cern.ch/SITECONF/T2_CH_CSCS/JobConfig/site-local-config.xml <----

Holidays
- Previous week I was on leave, next week I'll be on leave too

UNIBE-LHEP

Operations
- Nothing specific to report
ATLAS specific operations
- Nothing specific to report
HammerCloud report [1]
- UNIBE-LHEP online 74% (was 79% last month).
- UNIBE-ID 97% (this doesn't run the high I/O workloads, but it runs analysis)
- UNIBE-LHEP_CLOUD* 95%

[1] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE

ATLAS resource delivery UNIBE-LHEP vs CSCS-LCG2 [2]
- All jobs: 47% of ATLAS/CH (WallTime), 78% of ATLAS/CH (CPUtime)
- Good jobs: 68% of ATLAS CH (WallTime), 84% of ATLAS/CH (CPUtime)

[2] http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary#button=cpuconsumption&sites%5B%5D=CSCS-LCG2&sites%5B%5D=UNIBE-LHEP&sitesCat%5B%5D=All+Countries&resourcetype=All&sitesSort=2&sitesCatSort=0&start=2016-06-01&end=2016-06-30&timerange=daily&granularity=Monthly&generic=0&sortby=0&series=All

Accounting numbers (from scheduler) for last month (Jul 2016) (includes ce03/CLOUD)

WC h: 780748 (ATLAS) - 35044 (t2k.org) - 3289 (uboone) - 12 (ops)

UNIBE-ID

Change of Resource Manager:
- ATLAS (ARC-CE) now served by new Slurm server
- Transition was easy enough, minor quirks in the first couple of hours due to forgotten change to singlenode environment
- Since then stable operation
- Rest of the cluster will be moved to Slurm in next maintenance down (2nd Thursday of December) => moew cores again for ATLAS
- after OG-SGE dumped
Operations
- Very stable operations lately

UNIGE

Operations
- Back into ATLAS production mode since July 25th:
  - Memory hacked at PBS batch scheduler for running ATLAS production jobs
  - Debugging Multi-Core jobs: Not running successfully yet
- Running smoothly: Lower user activity due to holidays period
Network
- Upgrade of network swicth (10 Gb/s) for File Systems soon
Holidays
- Next 2 weeks
Accounting numbers (from scheduler) from last month

NGI_CH

EGI central monitoring instance (ARGO)

Since July 1st, the EGI infrastructure is being monitored by two monitoring instances that can be found on these addresses:

https://argo-mon.egi.eu/nagios
https://argo-mon2.egi.eu/nagios

Both instances are running the same set of tests and results provided are equivalent.

Starting from the same date, the central ARGO Web UI ( http://argo.egi.eu/lavoisier ) provides information from these two instances and the Operations Portal was reconfigured to raise alarms based on information from ARGO central instances.

NGI-CH Open Tickets review
- CSCS
  - 122679 (CMS) timeout in file copy to SE (switch to gfal-copy broke some Nagios tests?)
  - 122486 (ATLAS) expose the full PFN through their xrootd doors => just closed it
  - 122155 (ATLAS) file transfers failing (inconsistent file size & checksum): 14 new files to check (updated today)