Tags:
create new tag
view all tags
<!-- keep this as a security measure:
* Set ALLOWTOPICCHANGE = TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
* Set ALLOWTOPICRENAME = TWikiAdminGroup,Main.LCGAdminGroup
#uncomment this if you want the page only be viewable by the internal people
#* Set ALLOWTOPICVIEW = TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

Swiss Grid Operations Meeting on 2018-11-08 at 14:00

  • Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 10537598)
  • External link: https://vidyoportal.cern.ch/flex.html?roomdirect.html&key=FAEn4zjAba7BqoQ11TGZu66VSDE
  • Phone gate: From Switzerland: 0227671400 (portal) + 10537598 (extension) + # (pound sign)
  • IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
  • Switch Vidyo SIP IP: 137.138.248.204

Site status

CSCS

PSI

UNIBE-LHEP

  • A bit less stable (lack of manpower), lower delivery for a few months, still fulfilling the pledge.
  • Ubelixed dropped out silently on 10th October
  • Running an average <1900 slots (typical 2500), Ubelix contribution 12% (typical 23%)
  • Large t2k.org run in September, 1 cluster reserved for a local user for almost the entire month

  • Accounting numbers (from scheduler) from last month (October), LHEP only

    VOJob TypeProduced WC core-hours
    ATLAS Any

    1157991

    ops Any 44
    t2k.org Any

    0

    uboone Any 0




  • Five month history Unibe (pledge: 18 kHS06)
  • Swiss ATLAS statistics
    • HC availability [1]:
      • CSCS-LCG2: 95% Prod, 97% Analy
      • CSCS-LCG2-HPC: 75% Prod, 76% Analy
      • UNIBE-LHEP: 99% Prod, 96% Analy
      • UNIBE-LHEP-UBELIX: 100% ($), Prod, 27% Analy

        ($) effectively up ~30% only

    • CSCS running 3300 slots on average, UNIBE running 1850
    • Accounting numbers (from dashboard) from last month for CSCS and UNIBE

Cluster Job Type Produced WC core-hours Good vs Bad WC % CPU eff good jobs %
CSCS Any 2901550 (69%) 0.71 0.89
Unibe Any 1266896 (31%) 0.85 0.85






[1] http://dashb-atlas-ssb.cern.ch/dashboard/request.py/siteviewhistorywithstatistics?columnid=562#time=custom&start_date=2018-10-01&end_date=2018-10-31&use_downtimes=false&merge_colors=false&sites=multiple&clouds=all&site=ANALY_CSCS,ANALY_CSCS-HPC,ANALY_UNIBE-LHEP,ANALY_UNIBE-LHEP-UBELIX,CSCS-LCG2-HPC_MCORE,CSCS-LCG2_MCORE,UNIBE-LHEP-UBELIX_MCORE,UNIBE-LHEP_MCORE

UNIBE-ID

  • Enabled EGI ARGO notification e-mails in GOCDB to respond to CE stalling silently
  • Opportunistic usage on Ubelix to be added as soon as the sl6 legacy partition will be discontinued
    • slurm pre-emptable partition
    • ATLAS can use idle slots
    • ATLAS jobs killed (not checkpointed) when slots needed by other users

UNIGE

  • Re-commissioning of ARC CE delayed
  • Distrtibuted DPM storage working well

NGI_CH

  • Our deal with EGI for certificates expires in March 2019
    • Science IT support Bern is looking into what the alternative will be

* NGI-CH Open Tickets review

Ticket-ID Type VO Site Priority Resp. Unit Status Last UpdateSorted ascending Subject Scope
131965   none UNIBE-LHEP less urgent NGI_CH assigned on hold 2018-10-04 IPv6 deployment at WLCG Tier-2 sites EGI
133695 lhcb CSCS-LCG2 urgent NGI_CH assigned in progress 2018-10-19 Data access problem at CSCS-LCG2 WLCG
132927   cms CSCS-LCG2 urgent NGI_CH assigned involved in progress 2018-11-12 Problem with APEL Accounting for all of ... EGI
131948   none CSCS-LCG2 less urgent NGI_CH assigned in progress 2018-11-13 IPv6 deployment at WLCG Tier-2 sites EGI
138296   cms CSCS-LCG2 urgent NGI_CH assigned 2018-11-14 Transfers failing from T2_CH_CSCS WLCG
138314 atlas CSCS-LCG2 less urgent NGI_CH assigned 2018-11-15 DE CSCS-LCG2 : transfer failures with ... WLCG

Other topics

  • Follow up to fair-share meeting

  • Two questions, one for the slurm experts, one for the VO reps:
    • is slurm charging the reserved time or the elapsed*cores time to the user fair-share?
      • NICK: no, it is using (endtime-starttime)*cores

    • possible mitigation: pack single core jobs on nodes, as opposed to distribute them across all nodes. How does this sound?
      • this should reduce the node fragmentation and give the MC jobs more opportunities to run timely
        • NICK: cannot comment at the moment, will look at it

  • Other possible mitigations to be discussed internally between VOs need input from CSCS:
    • Distribution of job queue waiting time, last 2 Quarters, split by: Daint vs Phoenix, VO and 8-core vs 1-core (we should exclude from these plots the T0 jobs)
      • NICK: CSCS will investigate providing queue wait time reporting
    • Anything else?
      • NICK: Move forward with Stefano’s recommendation on Tuesday for a face-to-face meeting, preferably before the end of the year

  • Can we agree that the Daint and Phoenix shares (30 or 60 day historical view) will be monitored monthly at this meeting?
    • GIANFRANCO: not discussed

  • Topic2
    ...

Next meeting date:

A.O.B.

Attendants

  • CSCS:
  • CMS:
  • ATLAS:
  • LHCb:
  • EGI:

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r8 - 2018-11-16 - GianfrancoSciacca
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback