Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2016-05-12 at 14:00

Site status

CSCS

SYSTEM

  • IB eth bridges replaced
CONFLICT original 8:

SYSTEM

CONFLICT version new: SYSTEM * Accounting numbers (from scheduler) from last month * IB eth bridges replaced

CONFLICT end

  • A few IB QDR switches replaced with FDR switches
  • Compute nodes re-installed
    • CSCS Central puppet
    • CSCS LDAP for users
    • CSCS NFS for home
  • ARC CE fresh installed with the following queues:
    • arc01 64nodes (sandy bridge nodes, 64GB ram, 32 cores)
    • arc02 48 nodes (ivy bridge nodes, 128GB ram, 40 cores)
    • arc03 40 nodes (haswell nodes, 128GB ram, 48 cores) soon updated to v4
  • CREAM01/02/03: reviewing accounting before final shutdown
  • All Virtual machines are running on CSCS central VMware
  • CMS re-installation to be planned with Puppet base installation (Firewall, Users, Grid Certificates, ..)
  • Current allocation over 90%
    • Allocation problems on the old 64GB RAM nodes (arc01 queue)
STORAGE
  • GPFS
    • Few weeks ago we reached 310M of used inodes on the scratch fileset
    • servers high load -> slow cleaning policy -> job problems
    • filesystem stayed online
Consequence
    • per user inode quota (50M)
    • inode usage alerts

  • dCache
    • 9K+ active connections (record?)
    • 12+GBit internet bandwith measured on the network (2x 6+Gbit)
    • Real limit was about 2x8Gbit
    • New limit is about 80Gbit with the new gateways
dcache connections

Some technical details and numbers of the new storage that will be available in the next days (1PB)

  1. NETAPP (0.5PB)
    2xController / 4xFC16 Links / 10xLUNs / 12x6TB drives per LUN (RAID6)

    Architecture
    cscs san


    Performances (6x dd from 2 servers each on a different LUN, 3x controller)
    e5600

  2. DDN SFA12K (0.5PB)
    2x Controller / 4x Storage Processors / 4x FC16 Links (1xStorage Processor) / 24LUNs / 10x3TB drives per LUN (RAID6)

    Architecture -> same CSCS integration
    Performances will follow asap

PSI

  • T3 upgraded to 10Gbs
    • The local Net team deployed a CISCO Extender, 32 ports 10Gbs CAT6, 8 uplinks 10Gbs in Fibre ; so far 4/8 uplinks cabled
    • The 32 ports can be organized as LACP groups ; we made 5 groups with 2 ports inside ( Tot. 10 ports cabled ) ; both the dCache Storage servers and the ZFS/NFSv4 NAS are connected to these LACP groups
    • Other 9 ports are used for the latest 10Gbs WNs ( ~500 CPUs core )
    • Generally speaking, it was a nice 1Gbs to 10Gbs transition without unexpected troubles
  • dCache upgraded from 2.13 to 2.15
    • Configuration files as in 2.13 worked also in 2.15, nice
    • But the Chimera Tables changed ; there is a new field inumber that's used as a key here and there in respect of ipnfsid age ( that's still in the tables though )
    • The Chimera Table change means that both chimera-dump and my materialized view v_pnfs don't work right now
    • I won't update chimera-dump while I'm updating v_pnfs to be 2.15 compatible ; CSCS will have my same issues on its own dCache roadmap
  • Need to distinguish between NFSv4 interactive traffic ( users ) and batch system traffic ( jobs )
    • Because we've added ~500 CPUs core sometime the hundreds of jobs make our NFSv4 slow ( >1s delay ) and the users working interactively on the UIs get affected ; we need to assign more 'NFSv4 priority' to these UIs in respect of the WNs, for instance by setting shorewall on the NFSv4 server ? if you have already solved similar issues kindly contact us
  • Accounting numbers (from scheduler) from last month

UNIBE-LHEP

Operations

  • stable, no incidents to report
ATLAS specific operations
  • HC online ~90% (last month). Still room for improvement, btu not too big impact since interruptions are not long enough to cause the site to drain. UNIBE-ID >96%
http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562&view=Shifter%20view#time=720&start_date=&end_date=&use_downtimes=false&merge_colors=false&sites=multiple&clouds=ND&site=UNIBE-LHEP,UNIBE-LHEP-UBELIX,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_CLOUD,UNIBE-LHEP_CLOUD_MCORE,UNIBE-LHEP_MCORE


  • 52% of ATLAS/CH WT, 54% CPUtime in April
  • No progress on DPM head node migration to SLC6 and ATLAS storage dumps

  • Accounting numbers (from scheduler) from last month (Apr 2016) ( includes ce03/CLOUD )
    • WC h: 1028684 (ATLAS) - 149450 (t2k.org) -10776 (uboone) - 11 (ops)
  • Accounting numbers (from ATLAS dashboard) from last month (Apr 2016)
    • CPU h: 967738
    • WC h: 1108219

UNIBE-ID

  • Smooth operation the last weeks, except
    • a lot of CERN jobs get killed due to h_vmem limit violation
      • known issue: gridengnine counting issues with shared libraries
      • no patch for (O)SGE available
      • was no problem in the past (xxxx - 2015)
      • no solution known (except moving to SLURM - not possible ATM); running 32-bit? only sgemaster? Experiences?
  • Procueremnt:
    • 76 new compute nodes (E5v4-10C@2.2GHz) ordered and get devlivered on 9th+10th of June
    • doubling IB-Spine Switches => recabling of whole IB stuff

UNIGE

  • Operations:
    • Running smoothly with higher user usage of the cluster for last months
    • 2 NFS File Servers for DPM SE with RAID controller damage: Both changed and came back into production
    • ATLAS production jobs stopped since last Sunday: Contacted Gianfranco to ask about it
      • Some problems with Data Management at some sites, but we were removed and not put it back until today
    • A long list of action items:
      • Such as CentOS,
      • add WNs into the batch system,
      • add new NFS File Server for ATLAS DPM SE,
      • create a new pool inside DPM SE for another group: DAMPE
      • Network upgrade to 10 Gb/s
      • Move to SLURM for batch system and Puppet for DM SE
  • Accounting numbers (from scheduler) from last month

NGI_CH

* Xxx * EGI mostly focussing on Fed Cloud operation consolidation * NGI-CH Open Tickets review * NGI-CH Open Tickets review: https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO * No ticket not touched for 1 week or more

CONFLICT original 8: SYSTEM * Xxx * EGI mostly focussing on Fed Cloud operation consolidation * NGI-CH Open Tickets review * NGI-CH Open Tickets review: https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO * No ticket not touched for 1 week or more

CONFLICT version 9:

CONFLICT version new: SYSTEM * Accounting numbers (from scheduler) from last month * IB eth bridges replaced
  • EGI mostly focussing on Fed Cloud operation consolidation
  • New MW products for CenOS 7:
  • ARC
  • Argus
  • dcache
CONFLICT end

Other topics

  • Topic1
  • Topic2
Next meeting date:

NOTE: Week of 30th May is the Nordugrid conference (GS not available)

A.O.B.

Attendants

  • CSCS:
  • CMS: Fabio Martinelli
* ATLAS: * ATLAS: Apologies (Gianfranco) * LHCb: * EGI: * EGI: Apologies (Gianfranco)

Action items

  • Item1
More... Close
  • dcache 9k:
    poolqueueplots2.png

  • e5600perf:
    e5600two.png

  • sancscs:
    phoenix-cscs-san.png
Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf May12_CMS.pdf r1 manage 1650.8 K 2016-05-12 - 12:00 JoosepPata CMS report on computing resources
PNGpng UniGe_GRID_last_month.png r1 manage 41.6 K 2016-05-12 - 12:12 LuisMarch Unige GRID last month
PNGpng UniGe_Users_last_month.png r1 manage 33.3 K 2016-05-12 - 12:13 LuisMarch UniGe Users last month
PNGpng e5600two.png r1 manage 56.7 K 2016-05-12 - 10:08 DarioPetrusic e5600perf
Unknown file formatlog g07.201603.log r1 manage 1.2 K 2016-05-12 - 12:09 LuisMarch Unige accounting - March 2016
Unknown file formatlog g07.201604.log r1 manage 1.1 K 2016-05-12 - 12:10 LuisMarch UniGe accounting - April 2016
PNGpng phoenix-cscs-san.png r1 manage 95.1 K 2016-05-12 - 10:08 DarioPetrusic sancscs
PNGpng poolqueueplots2.png r1 manage 24.8 K 2016-05-12 - 10:04 DarioPetrusic dcache 9k
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r14 - 2016-06-01 - GianfrancoSciacca
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback