Swiss Grid Operations Meeting on 2016-02-04 at 14:00

Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2016-02-04 at 14:00
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

STORAGE

Hardware / Physical install
- 8 Feb: new dCache servers (4x)
- 8 Feb: MPO in order to connect Phoenix to the CSCS SAN
- 9 Feb: NETAPP E5660 (~0.5PB)

dCache
- The ‘cleaner problem’ (mainly affecting CMS) is no more present. Space is freed automatically as expected
- Atlas dumps in place, something to adjust for 'atlasgroupdisk/perf-egamma' and 'atlasscratchdisk’ ( https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=428 )

GPFS
- Unplanned maintenance was needed on Wed 3rd Feb in order to recreate the filesystem because of a metadata inconsistency problem.
Systems

- Preparing and consolidating racks for new arrivals end of this month

- Checking published values of HEPspec

- Tuned slurm config to improove cluster performance

- Fixed two HP nodes, one of them whit IB failures and the other the 1G man network card

- Testing complete Puppet installation for worker nodes, is working fine, i have just to check some cvmfs parameters and cream wrapper script.

Accounting numbers (from scheduler) from last month
- http://ganglia.lcg.cscs.ch/ganglia/SLURM_REPORTS/phoenix_slurm_report_201601.txt

PSI

Xxx
Accounting numbers (from scheduler) from last month

UNIBE-LHEP

Operations

Nothing significant to report; stable operation on both systems
256 new cores delivered yesterday, hope to deploy before weekend

ATLAS specific operations

No progress on the storage dumps requested by ATLAS (due to no progress in the re-deployment of the DPM head node on SLC6)
ANALY_UNIBE-LHEP blacklisted in HC: no time to debug but low impact since right now ANALY jobs aren't too many
A couple of stabile weeks of operation for UNIBE-LHEP_CLOUD_MCORE, then we lost the cluster and could not fix it yet

Accounting

Accounting numbers (from scheduler) from last month (Jan 2016)
- CPU h: 792492 (ATLAS) - 12671 (t2k.org) - 1879 (uboone) - 25 (ops)
Accounting numbers (from ATLAS dashboard) from last month (Jan 2016)
- CPU h: 662466 (774848 with cloud)
- WC h: 679368 (796292 with cloud)

UNIBE-ID

Xxx
Accounting numbers (from scheduler) from last month

UNIGE

Operations

Running smoothly: Higher user activity since last meeting
Grid (ATLAS) jobs: UNIGE-DPNC in "Test" status and ~ 1/3 oj jobs failed due to (apparently) "ran out of memory". Need checks
We plan a scheduled downtime at some point: Needed for upgrading system and security (related to get involved for ATLAS production also)

Storage

Dump of DPM SE for ATLAS experiment finally submitted (this dump should be provided once a month)
In addition to these ATLAS checks, we should clean our DPM: Old user data and other projects (To Be Done)

Outlook

Request for new network switch upgrade to 10 Gb/s + adquisition of 3 GPUs already submitted (wait for resolution in ~ March 2016)
Install puppet for DPM SE (and probably also for cluster configuration and setup, replacing yaim)

Accounting

Accounting numbers (from scheduler) from last month

NGI_CH

Nothing to report
NGI-CH Open Tickets review

https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO

- CSCS-LCG2
  - 117786 (ATLAS: storage dumps) almost done - should fix two paths
  - 119021 (LHCb team: jobs failed) no information provided - changed to "waiting for reply"
  - 119171 (CMS: Workflow failures) in progress
- UNIBE-LHEP
  - 117899 (ATLAS: storage dumps) on hold
- NGI_CH
  - 118922 (affects CSCS-LCG2 and UNIBE-LHEP): GlueSubClusterPhysicalCPUs, GlueSubClusterLogicalCPUs in the bdii - added explicit notification to CSCS-LCG2

A.O.B.

Attendants

CSCS:
CMS:
ATLAS: Luis March
LHCb:
EGI: Luis March

Action items

Item1

This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingSwissGridOperations20160204
Topic revision: r10 - 2016-02-04 - GianfrancoSciacca