Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2016-01-14

Site status

CSCS

  • Compute
    • Confirmed order of 40 WN
      • Details:
        • DALCO r2264i5t 2U Scalable Compute Module (4 nodes in one 2U chassis) configuration with:
        • Each node supports:
          • 2 x 14 Core Intel Xeon E5-2680 v4 2.4GHz Processor (~1Q2016)
          • 128GB 2133MHz DDR4 ECC Reg. Server Memory
          • 1 x 120GB Intel SSD Drive
          • 2 x Intel 10/100/1000 Gigabit-Ethernet onboard
          • 1 x ETH0/NIC1 LAN Interface for IPMI & KVM over IP
          • 1 x Mellanox ConnectX-3 FDR Dual-Port Adapter
          • Two High-Efficiency 1600Watt hot-plug redundant power supplies
    • As discussed with Gianfranco we disabled reboot request every ~60 days for WNs to avoid idle
    • Updated nodes to SL 6.7 and packages on all wn
    • Started working on the new sltop python version using pyslurm python module
    • Fixing some small issues with the user creation with puppet on the WNs
    • We have some HP nodes down
  • Accounting numbers (from scheduler) from last month (slurm report)
  • Storage
    • New hardware:
      • 4x NETAPP E5660 (total 500TB) arrived + 500TB from CSCS SAN will be ready soon
      • 4x Lenovo M5 servers delivery scheduled for the next week
  • GPFS
    • No issues or news
  • DCACHE
    • Upgrade to the v2.10 has been completed.
    • Still have to fix some issues
    • Dumps for Atlas will work starting this month (host mapping from atlas01)

PSI

  • Boot issues with HP Smart Array P440 Controller in HBA mode
    • Had boot issues with HP Smart Array P440 Controller on CentOS7 with the controller configured in HBA mode ; had to manually upgrade the controller driver to hpsa-3.4.12 ; read about these driver updates cases here ; usually you won't want to do that but..
  • Nagios and HP HW
  • ZFS On CentOS7
    • So far satisfied , be aware of the new Jan'16 release 0.6.5.4 ; here the issues closed
    • Still not in production though ; there are a lot of news in CentOS7 that I want to thoroughly learn and the Christmas break didn't help me
  • Finally new HW at PSI in 2016

UNIBE-LHEP

Operations

  • ce02 re-installed with ROCKS 6.2, SLC6.7 and SLURM 15.08.04 (320 worker cores still to be installed)
  • ce01 operating with SLURM 15.08.01 for 2+half month and very stable (256 new cores awaiting delivery)
  • 1980 cores after installation finalised
  • Finalised the integration of the Microboone VO (OSG - Fermilab) on both clusters
  • New CE (ce04) fronting a SLURM cluster running on the SWITCHEngines cloud infrastructure
    • Elasticluster on OpenStack
    • Integrated in Panda ( UNIBE-LHEP_CLOUD )
    • Currently commissioning with HammerCloud
ATLAS specific operations
  • Very stable running over the end-of-year break
  • Monthly dumps of the namespace on the DPM SE still depending on re-deployment of the DPM head node on SLC6 (from yaim to puppet for configuration)

  • Accounting numbers (from scheduler) from last month
    • Incomplete+provisional (ce02 missing)
    • CPU h: 410292 (ATLAS) - 239 t2k.org - 572 uboome - 18 ops

UNIBE-ID

  • Xxx

UNIGE

Operations

  • Running smoothly during last month and Christmas break (lower user activity at the cluster)
  • Outlook: We will install puppet for DPM (and probably also for cluster configuration and setup)
    • A testbed with a File Server (atlasfs29) and 1 PC for services (puppet)
  • Outlook: We are going to request around 3 GPUs (image processing) for user request
Network - Outlook
  • We are going to request a new network switch of 10 Gb/s (network swicth upgrade) for the cluster
    • Basically focused on the data management system (DPM and NFS, mainly on NFS)
  • The request will be done at the end of the month and, if approved, then we could have it in Summer
Storage
  • Checking the data stored at the DPM SE (dump) for cleaning purposes, since ATLAS requested it
  • Not sent yet to ATLAS (scheduled to be done)
Accounting

NGI_CH

  • Reminder: SL5 services must be decommissioned by end of April 2016
    • Affected:
      • UNIBE-LHEP (DPM head node)
      • UNIGE-DPNC? (some DPM pool nodes?)
      • CSCS-LCG2? (some dCache pool nodes?)
  • Feedback invited: Distributing middleware as Docker images
    releasing UMD4 products as Docker images in addition to RPMs
    pros: can run on hardware, no virtualization platform needed
    cons: maybe hard to create/maintain
    to be provided by TPs and/or volunteer sites
    possible profiles: site/top BDII, CEs
  • NGI-CH Open Tickets: https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO
    • UNIBE-LHEP:
      • 118692 (http support on the DPM SE) still need to investigate
      • 117899 (storage dumps for ATLAS) on hold, needs DPM head on SLC6 first
    • CSCS-LCG2:
      • 118253 (decommissioning dCache 2.6) can be closed
      • 117786 (storage dumps for ATLAS)
    • UNIGE-DPNC:
      • 117900 (storage dumps for ATLAS)

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS:
  • CMS: Fabio, Joosep
  • ATLAS: Gianfranco
  • LHCb: Roland
  • EGI: Gianfranco

Action items

  • Item1
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Accounting_last_month.png r1 manage 39.7 K 2016-01-14 - 13:45 LuisMarch UniGe accounting (last month)
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r14 - 2016-01-22 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback