Swiss Grid Operations Meeting on 2016-01-14
Site status
CSCS
- Compute
- Confirmed order of 40 WN
- Details:
- DALCO r2264i5t 2U Scalable Compute Module (4 nodes in one 2U chassis) configuration with:
- Each node supports:
- 2 x 14 Core Intel Xeon E5-2680 v4 2.4GHz Processor (~1Q2016)
- 128GB 2133MHz DDR4 ECC Reg. Server Memory
- 1 x 120GB Intel SSD Drive
- 2 x Intel 10/100/1000 Gigabit-Ethernet onboard
- 1 x ETH0/NIC1 LAN Interface for IPMI & KVM over IP
- 1 x Mellanox ConnectX-3 FDR Dual-Port Adapter
- Two High-Efficiency 1600Watt hot-plug redundant power supplies
- As discussed with Gianfranco we disabled reboot request every ~60 days for WNs to avoid idle
- Updated nodes to SL 6.7 and packages on all wn
- Started working on the new sltop python version using pyslurm python module
- Fixing some small issues with the user creation with puppet on the WNs
- We have some HP nodes down
-
- Accounting numbers (from scheduler) from last month (slurm report)
- Storage
- New hardware:
- 4x NETAPP E5660 (total 500TB) arrived + 500TB from CSCS SAN will be ready soon
- 4x Lenovo M5 servers delivery scheduled for the next week
- GPFS
- DCACHE
- Upgrade to the v2.10 has been completed.
- Still have to fix some issues
- Dumps for Atlas will work starting this month (host mapping from atlas01)
PSI
- Boot issues with HP Smart Array P440 Controller in HBA mode
- Had boot issues with HP Smart Array P440 Controller on CentOS7 with the controller configured in HBA mode ; had to manually upgrade the controller driver to
hpsa-3.4.12
; read about these driver updates cases here ; usually you won't want to do that but..
- Nagios and HP HW
- ZFS On CentOS7
- So far satisfied , be aware of the new Jan'16 release 0.6.5.4 ; here the issues closed
- Still not in production though ; there are a lot of news in CentOS7 that I want to thoroughly learn and the Christmas break didn't help me
- Finally new HW at PSI in 2016
- With current $$ in Q1'16 we're going to buy :
- With new $$ after Q1'16 we're going to buy :
UNIBE-LHEP
Operations
- ce02 re-installed with ROCKS 6.2, SLC6.7 and SLURM 15.08.04 (320 worker cores still to be installed)
- ce01 operating with SLURM 15.08.01 for 2+half month and very stable (256 new cores awaiting delivery)
- 1980 cores after installation finalised
- Finalised the integration of the Microboone VO (OSG - Fermilab) on both clusters
- New CE (ce04) fronting a SLURM cluster running on the SWITCHEngines cloud infrastructure
- Elasticluster on OpenStack
- Integrated in Panda ( UNIBE-LHEP_CLOUD )
- Currently commissioning with HammerCloud
ATLAS specific operations
- Very stable running over the end-of-year break
- Monthly dumps of the namespace on the DPM SE still depending on re-deployment of the DPM head node on SLC6 (from yaim to puppet for configuration)
- Accounting numbers (from scheduler) from last month
- Incomplete+provisional (ce02 missing)
- CPU h: 410292 (ATLAS) - 239 t2k.org - 572 uboome - 18 ops
UNIBE-ID
UNIGE
Operations
- Running smoothly during last month and Christmas break (lower user activity at the cluster)
- Outlook: We will install puppet for DPM (and probably also for cluster configuration and setup)
- A testbed with a File Server (atlasfs29) and 1 PC for services (puppet)
- Outlook: We are going to request around 3 GPUs (image processing) for user request
Network - Outlook
- We are going to request a new network switch of 10 Gb/s (network swicth upgrade) for the cluster
- Basically focused on the data management system (DPM and NFS, mainly on NFS)
- The request will be done at the end of the month and, if approved, then we could have it in Summer
Storage
- Checking the data stored at the DPM SE (dump) for cleaning purposes, since ATLAS requested it
- Not sent yet to ATLAS (scheduled to be done)
Accounting
NGI_CH
Other topics
Next meeting date:
A.O.B.
Attendants
- CSCS:
- CMS: Fabio, Joosep
- ATLAS: Gianfranco
- LHCb: Roland
- EGI: Gianfranco
Action items
This topic: LCGTier2
> WebHome >
MeetingsBoard > MeetingSwissGridOperations20160114
Topic revision: r14 - 2016-01-22 - FabioMartinelli