Swiss Grid Operations Meeting on 2016-05-12 at 14:00
Site status
CSCS
SYSTEM
CONFLICT original 8:
SYSTEM
CONFLICT version new:
SYSTEM * Accounting numbers (from scheduler) from last month * IB eth bridges replaced
CONFLICT end
- A few IB QDR switches replaced with FDR switches
- Compute nodes re-installed
- CSCS Central puppet
- CSCS LDAP for users
- CSCS NFS for home
- ARC CE fresh installed with the following queues:
- arc01 64nodes (sandy bridge nodes, 64GB ram, 32 cores)
- arc02 48 nodes (ivy bridge nodes, 128GB ram, 40 cores)
- arc03 40 nodes (haswell nodes, 128GB ram, 48 cores) soon updated to v4
- CREAM01/02/03: reviewing accounting before final shutdown
- All Virtual machines are running on CSCS central VMware
- CMS re-installation to be planned with Puppet base installation (Firewall, Users, Grid Certificates, ..)
- Current allocation over 90%
- Allocation problems on the old 64GB RAM nodes (arc01 queue)
STORAGE
- GPFS
- Few weeks ago we reached 310M of used inodes on the scratch fileset
- servers high load -> slow cleaning policy -> job problems
- filesystem stayed online
Consequence
-
- per user inode quota (50M)
- inode usage alerts
- dCache
- 9K+ active connections (record?)
- 12+GBit internet bandwith measured on the network (2x 6+Gbit)
- Real limit was about 2x8Gbit
- New limit is about 80Gbit with the new gateways
Some technical details and numbers of the new storage that will be available in the next days (1PB)
- NETAPP (0.5PB)
2xController / 4xFC16 Links / 10xLUNs / 12x6TB drives per LUN (RAID6)
Architecture
Performances (6x dd from 2 servers each on a different LUN, 3x controller)
- DDN SFA12K (0.5PB)
2x Controller / 4x Storage Processors / 4x FC16 Links (1xStorage Processor) / 24LUNs / 10x3TB drives per LUN (RAID6)
Architecture -> same CSCS integration
Performances will follow asap
PSI
- T3 upgraded to 10Gbs
- The local Net team deployed a CISCO Extender, 32 ports 10Gbs CAT6, 8 uplinks 10Gbs in Fibre ; so far 4/8 uplinks cabled
- The 32 ports can be organized as LACP groups ; we made 5 groups with 2 ports inside ( Tot. 10 ports cabled ) ; both the dCache Storage servers and the ZFS/NFSv4 NAS are connected to these LACP groups
- Other 9 ports are used for the latest 10Gbs WNs ( ~500 CPUs core )
- Generally speaking, it was a nice 1Gbs to 10Gbs transition without unexpected troubles
- dCache upgraded from 2.13 to 2.15
- Configuration files as in 2.13 worked also in 2.15, nice
- But the Chimera Tables changed ; there is a new field inumber that's used as a key here and there in respect of
ipnfsid
age ( that's still in the tables though )
- The Chimera Table change means that both chimera-dump and my materialized view v_pnfs don't work right now
- I won't update chimera-dump while I'm updating v_pnfs to be 2.15 compatible ; CSCS will have my same issues on its own dCache roadmap
- Need to distinguish between NFSv4 interactive traffic ( users ) and batch system traffic ( jobs )
- Lately we've added ~500 CPUs core and now sometimes the jobs make our NFSv4 slow ( >1s delay ) ; interactive users are affected ; we need to assign more priority to certain IPs, maybe by shorewall on the NFSv4 server ? if you've hints or experiences we'll be glad to listen you. OS is CentOS7
- Accounting numbers (from scheduler) from last month
UNIBE-LHEP
Operations
- stable, no incidents to report
ATLAS specific operations
- HC online ~90% (last month). Still room for improvement, btu not too big impact since interruptions are not long enough to cause the site to drain. UNIBE-ID >96%
http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE
- 52% of ATLAS/CH WT, 54% CPUtime in April
- No progress on DPM head node migration to SLC6 and ATLAS storage dumps
- Accounting numbers (from scheduler) from last month (Mar 2016) ( includes ce03/CLOUD )
- WC h: 1028684 (ATLAS) - 149450 (t2k.org) - 16739 (uboone) - 10776 (uboone) - 11 (ops)
- Accounting numbers (from ATLAS dashboard) from last month (Mar 2016)
- CPU h: 967738
- WC h: 1108219
UNIBE-ID
- Smooth operation the last weeks, except
- a lot of CERN jobs get killed due to h_vmem limit violation
- known issue: gridengnine counting issues with shared libraries
- no patch for (O)SGE available
- was no problem in the past (xxxx - 2015)
- no solution known (except moving to SLURM - not possible ATM); running 32-bit? only sgemaster? Experiences?
- Procueremnt:
- 76 new compute nodes (E5v4-10C@2.2GHz) ordered and get devlivered on 9th+10th of June
- doubling IB-Spine Switches => recabling of whole IB stuff
UNIGE
- Operations:
- Running smoothly with higher user usage of the cluster for last months
- 2 NFS File Servers for DPM SE with RAID controller damage: Both changed and came back into production
- ATLAS production jobs stopped since last Sunday: Contacted Gianfranco to ask about it
- Some problems with Data Management at some sites, but we were removed and not put it back until today
- A long list of action items:
- Such as CentOS,
- add WNs into the batch system,
- add new NFS File Server for ATLAS DPM SE,
- create a new pool inside DPM SE for another group: DAMPE
- Network upgrade to 10 Gb/s
- Move to SLURM for batch system and Puppet for DM SE
- Accounting numbers (from scheduler) from last month
NGI_CH
* Xxx
* EGI mostly focussing on Fed Cloud operation consolidation * NGI-CH Open Tickets review
* NGI-CH Open Tickets review: https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO * No ticket not touched for 1 week or more
CONFLICT original 8:
SYSTEM * Xxx
* EGI mostly focussing on Fed Cloud operation consolidation * NGI-CH Open Tickets review
* NGI-CH Open Tickets review: https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO * No ticket not touched for 1 week or more
CONFLICT version 9:
CONFLICT version new:
SYSTEM * Accounting numbers (from scheduler) from last month * IB eth bridges replaced
- EGI mostly focussing on Fed Cloud operation consolidation
- New MW products for CenOS 7:
- ARC
- Argus
- dcache
CONFLICT end
Other topics
Next meeting date:
NOTE: Week of 30th May is the Nordugrid conference (GS not available)
A.O.B.
Attendants
- CSCS:
- CMS: Fabio Martinelli
* ATLAS:
* ATLAS: Apologies (Gianfranco) * LHCb: * EGI:
* EGI: Apologies (Gianfranco)
Action items
More... Close
- dcache 9k:
- e5600perf:
- sancscs: