Swiss Grid Operations Meeting on 2016-02-04 at 14:00
Site status
CSCS
- STORAGE
Hardware / Physical install
- 8 Feb: new dCache servers (4x)
- 8 Feb: MPO in order to connect Phoenix to the CSCS SAN
- 9 Feb: NETAPP E5660 (~0.5PB)
dCache
- The ‘cleaner problem’ (mainly affecting CMS) is no more present. Space is freed automatically as expected
- Atlas dumps in place, something to adjust for 'atlasgroupdisk/perf-egamma' and 'atlasscratchdisk’ ( https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=428 )
GPFS
- Unplanned maintenance was needed on Wed 3rd Feb in order to recreate the filesystem because of a metadata inconsistency problem.
- Systems
- Preparing and consolidating racks for new arrivals end of this month
- Checking published values of HEPspec
- Tuned slurm config to improove cluster performance
- Fixed two HP nodes, one of them whit IB failures and the other the 1G man network card
- Testing complete Puppet installation for worker nodes, is working fine, i have just to check some cvmfs parameters and cream wrapper script.
- Accounting numbers (from scheduler) from last month
PSI
- Xxx
- Accounting numbers (from scheduler) from last month
UNIBE-LHEP
Operations
- Nothing significant to report; stable operation on both systems
- 256 new cores delivered yesterday, hope to deploy before weekend
ATLAS specific operations
- No progress on the storage dumps requested by ATLAS (due to no progress in the re-deployment of the DPM head node on SLC6)
- ANALY_UNIBE-LHEP blacklisted in HC: no time to debug but low impact since right now ANALY jobs aren't too many
- A couple of stabile weeks of operation for UNIBE-LHEP_CLOUD_MCORE, then we lost the cluster and could not fix it yet
Accounting
- Accounting numbers (from scheduler) from last month (Jan 2016)
- CPU h: 792492 (ATLAS) - 12671 (t2k.org) - 1879 (uboone) - 25 (ops)
- Accounting numbers (from ATLAS dashboard) from last month (Jan 2016)
- CPU h: 662466 (774848 with cloud)
- WC h: 679368 (796292 with cloud)
UNIBE-ID
- Xxx
- Accounting numbers (from scheduler) from last month
UNIGE
Operations
- Running smoothly: Higher user activity since last meeting
- Grid (ATLAS) jobs: UNIGE-DPNC in "Test" status and ~ 1/3 oj jobs failed due to (apparently) "ran out of memory". Need checks
- We plan a scheduled downtime at some point: Needed for upgrading system and security (related to get involved for ATLAS production also)
Storage
- Dump of DPM SE for ATLAS experiment finally submitted (this dump should be provided once a month)
- In addition to these ATLAS checks, we should clean our DPM: Old user data and other projects (To Be Done)
Outlook
- Request for new network switch upgrade to 10 Gb/s + adquisition of 3 GPUs already submitted (wait for resolution in ~ March 2016)
- Install puppet for DPM SE (and probably also for cluster configuration and setup, replacing yaim)
Accounting
- Accounting numbers (from scheduler) from last month
NGI_CH
- Nothing to report
- NGI-CH Open Tickets review
https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO
-
- CSCS-LCG2
- 117786 (ATLAS: storage dumps) almost done - should fix two paths
- 119021 (LHCb team: jobs failed) no information provided - changed to "waiting for reply"
- 119171 (CMS: Workflow failures) in progress
- UNIBE-LHEP
- 117899 (ATLAS: storage dumps) on hold
- NGI_CH
Other topics
Next meeting date:
A.O.B.
Attendants
- CSCS:
- CMS:
- ATLAS: Luis March
- LHCb:
- EGI: Luis March
Action items