Swiss Grid Operations Meeting on 2016-01-14

Date and time: At 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2016-01-14
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

Compute
- Confirmed order of 40 WN
  - Details:
    - DALCO r2264i5t 2U Scalable Compute Module (4 nodes in one 2U chassis) configuration with:
    - Each node supports:
      - 2 x 14 Core Intel Xeon E5-2680 v4 2.4GHz Processor (~1Q2016)
      - 128GB 2133MHz DDR4 ECC Reg. Server Memory
      - 1 x 120GB Intel SSD Drive
      - 2 x Intel 10/100/1000 Gigabit-Ethernet onboard
      - 1 x ETH0/NIC1 LAN Interface for IPMI & KVM over IP
      - 1 x Mellanox ConnectX-3 FDR Dual-Port Adapter
      - Two High-Efficiency 1600Watt hot-plug redundant power supplies
- As discussed with Gianfranco we disabled reboot request every ~60 days for WNs to avoid idle
- Updated nodes to SL 6.7 and packages on all wn
- Started working on the new sltop python version using pyslurm python module
- Fixing some small issues with the user creation with puppet on the WNs
- We have some HP nodes down
Accounting numbers (from scheduler) from last month (slurm report)
Storage
- New hardware:
  - 4x NETAPP E5660 (total 500TB) arrived + 500TB from CSCS SAN will be ready soon
  - 4x Lenovo M5 servers delivery scheduled for the next week
GPFS
- No issues or news
DCACHE
- Upgrade to the v2.10 has been completed.
- Still have to fix some issues
- Dumps for Atlas will work starting this month (host mapping from atlas01)

PSI

Boot issues with HP Smart Array P440 Controller in HBA mode
- Had boot issues with HP Smart Array P440 Controller on CentOS7 with the controller configured in HBA mode ; had to manually upgrade the controller driver to hpsa-3.4.12 ; read about these driver updates cases here ; usually you won't want to do that but..
Nagios and HP HW
- I recommend check_hpasm
- Deploy also on your management node the OS independent Freeipmi 1.5.1 + check_ipmi_sensor_v3 ; for instance :
- /opt/nagios/check_ipmi_sensor -f /opt/nagios/check_ipmi_sensor.user.pwd.privilege -H rmnfs01 -v -O '-D lan_2_0 -t Fan' reports :
- IPMI Status: OK | 'Fan 1 DutyCycle'=13.72 'Fan 2 DutyCycle'=13.72 'Fan 3 DutyCycle'=22.34 'Fan 4 DutyCycle'=42.73 'Fan 5 DutyCycle'=53.70 'Fan 6 DutyCycle'=53.70
- And because I'm using the HP servers in HBA mode I also check the SMART values by check_smart
ZFS On CentOS7
- So far satisfied , be aware of the new Jan'16 release 0.6.5.4 ; here the issues closed
- Still not in production though ; there are a lot of news in CentOS7 that I want to thoroughly learn and the Christmas break didn't help me
Finally new HW at PSI in 2016
- With current $$ in Q1'16 we're going to buy :
  - 1 x CISCO 10Gb/s Nexus Extender 2232TM-E ; it's not a switch, read the specs
  - 9 x Intel Server HNS2600TP24R ; each 64 cores type E5-2698-v3 , 128GB RAM, 2 ( or 4 if I have $$ to make RAID10 ) disks type Hitachi c10k900, 10Gb/s BASE-T onboard, 3y warranty
  - 1 x HP DL360 G9 as our new management node , 5y warranty
- With new $$ after Q1'16 we're going to buy :
  - 1 x NetApp E2760, 60x6TB disks ( link reports 60x4TB ), SAS, 5y warranty ; I consider the more powerful E5660 an overkill for our needs and I also have to save $$ for the other HW
  - 1 x HP Smart Array P841/4GB FBWC 12Gb 4-ports Ext SAS Controller (726903-B21) to be deployed inside one of my two HP DL380 G9 to connect the previous E2760
  - Probably another smaller bunch of Intel Server HNS2600TP24R

UNIBE-LHEP

Operations

ce02 re-installed with ROCKS 6.2, SLC6.7 and SLURM 15.08.04 (320 worker cores still to be installed)
ce01 operating with SLURM 15.08.01 for 2+half month and very stable (256 new cores awaiting delivery)
1980 cores after installation finalised
Finalised the integration of the Microboone VO (OSG - Fermilab) on both clusters
New CE (ce04) fronting a SLURM cluster running on the SWITCHEngines cloud infrastructure
- Elasticluster on OpenStack
- Integrated in Panda ( UNIBE-LHEP_CLOUD )
- Currently commissioning with HammerCloud

ATLAS specific operations

Very stable running over the end-of-year break
Monthly dumps of the namespace on the DPM SE still depending on re-deployment of the DPM head node on SLC6 (from yaim to puppet for configuration)

Accounting numbers (from scheduler) from last month
- Incomplete+provisional (ce02 missing)
- CPU h: 410292 (ATLAS) - 239 t2k.org - 572 uboome - 18 ops

UNIBE-ID

UNIGE

Operations

Running smoothly during last month and Christmas break (lower user activity at the cluster)
Outlook: We will install puppet for DPM (and probably also for cluster configuration and setup)
- A testbed with a File Server (atlasfs29) and 1 PC for services (puppet)
Outlook: We are going to request around 3 GPUs (image processing) for user request

Network - Outlook

We are going to request a new network switch of 10 Gb/s (network swicth upgrade) for the cluster
- Basically focused on the data management system (DPM and NFS, mainly on NFS)
The request will be done at the end of the month and, if approved, then we could have it in Summer

Storage

Checking the data stored at the DPM SE (dump) for cleaning purposes, since ATLAS requested it
Not sent yet to ATLAS (scheduled to be done)

Accounting

Accounting numbers (from scheduler) from last month
For accounting, at least a roughly number, we have ganglia for monitoring purposes:
http://atlasgrid.unige.ch/ganglia/?r=month&cs=&ce=&c=slc6-services&h=&tab=m&vn=&hide-hf=false&m=jobs_running&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name

NGI_CH

Reminder: SL5 services must be decommissioned by end of April 2016
- Affected:
  - UNIBE-LHEP (DPM head node)
  - UNIGE-DPNC? (some DPM pool nodes?)
  - CSCS-LCG2? (some dCache pool nodes?)
Feedback invited: Distributing middleware as Docker images

releasing UMD4 products as Docker images in addition to RPMs

pros: can run on hardware, no virtualization platform needed

cons: maybe hard to create/maintain

to be provided by TPs and/or volunteer sites

possible profiles: site/top BDII, CEs
NGI-CH Open Tickets: https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO
- UNIBE-LHEP:
  - 118692 (http support on the DPM SE) still need to investigate
  - 117899 (storage dumps for ATLAS) on hold, needs DPM head on SLC6 first
- CSCS-LCG2:
  - 118253 (decommissioning dCache 2.6) can be closed
  - 117786 (storage dumps for ATLAS)
- UNIGE-DPNC:
  - 117900 (storage dumps for ATLAS)