Tags:
create new tag
view all tags
<!-- keep this as a security measure:
* Set ALLOWTOPICCHANGE = TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
* Set ALLOWTOPICRENAME = TWikiAdminGroup,Main.LCGAdminGroup
#uncomment this if you want the page only be viewable by the internal people
#* Set ALLOWTOPICVIEW = TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

Swiss Grid Operations Meeting on 2019-03-07 at 14:00

Site status

CSCS

Systems

  • Phoenix: all nodes idle, load moved to Daint

  • ARC CEs reinstalled with new HW, only arc04 missing. Scheduled reinstallation of arc04 on Monday

  • Testing new squid and scheduled reinstallation of cvmfs and cvmfs1

  • Old compute nodes distribution?

    • lhcb want 2x Chassis wiht 4x nodes each
    • Atlas will take around 100x nodes
Storage
dCache
  • normal operation
  • preparing lab for new design + upgrade
  • complete storage migration will start in April
GPFS
  • normal operation
  • Upgraded firmware on DELL SC9000 (SSD tier1)
  • next week: rolling / online network interface replacement (from IB to Ethernet)
  • upgrade to GPFS 5.0.2
  • Slow tier migration in April from DDN SFA12k to DELL SC9000

PSI

UNIBE-LHEP

  • Ramping down LHEP in view of the cluster re-deployment

  • Monthly summary: Pledged: 18k, delivered 18k
  • Ubelix contributing >50% (23% typical)
  • Running an average >1850 slots (2500 typical)



  • 6-month history UniBE (pledge: 18 kHS06)



  • Accounting numbers (from scheduler) from last month, LHEP only
    • Omitted this month

Swiss ATLAS statistics

  • Hammercloud availability:


    • ANALY_CSCS-HPC: 95%
    • CSCS-LCG2-HPC_MCORE: 94.5%
    • UNIBE-* : 100%

  • Running slots
    • Large number of stuck jobs on ARC skew the statistics for CSCS, creating reporting problems to ATLAS
    • Very likely due to the reported issues with the Daint scratch file system, affecting WLCG jobs in some way

    • This required a laborious manual clean up:
      • job list provided by ATLAS, culled from the aCT
      • manual cleanup of the ARC sessiondir carried out by Miguel




  • Accounting Numbers from the ATLAS dashboard (February 2019) CSCS+UNIBE
Cluster Job Type Produced WC core-hours Good vs Bad WC % CPU eff good jobs %
CSCS Any 3'088'664; 72% 0.52 0.70
UniBe Any 1'149'338; 28% 0.75 0.75





  • Take home lessons from the last month:
    • Failed WC very high, we need some more real time alerts
    • Public dashboard replica offline for a while
    • ATLAS now relies fully on ARC services:
      • we need ARC metrics and/or logs
      • some ideas by Dino, but such implementation is not a short term affair
      • what can we do in the meanwhile?
      • we need monitoring/(nagios, graylog?) automated checks on ARC services

      • Please report general Daint issues that could affect WLCG jobs (SLACK-general or email) so that we can react if needed
    • ...

UNIBE-ID

  • End of Feb: Reinstallation of ARC CE on nordugrid.unibe.ch AKA UNIBE LHEP_UBELIX:
    • Reason: EL6 -> EL7
    • smooth transition within ~4h
    • no issues after reinstallation

UNIGE

  • Xxx
  • Accounting numbers (from scheduler) from last month

NGI_CH

  • Xxx
  • NGI-CH Open Tickets review

Other topics

  • Topic1
  • Topic2

Next meeting date:

A.O.B.

Attendants

  • CSCS:
  • CMS:
  • ATLAS:
  • LHCb:
  • EGI:

Action items

  • Item1
Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng HammerCloud.png r1 manage 453.1 K 2019-03-07 - 09:18 GianfrancoSciacca ATLAS HammerCloud last month
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r5 - 2019-03-07 - DinoConciatore
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback