Swiss Grid Operations Meeting on 2021-02-11 at 15:00

Next meeting: 11 March 2021 at 14h00


Minutes:

Mauro summarizing various conversations that happened in the past month:

  • Monthly check point / Ops meeting / SLACK
    • meetings are for reporting summaries and checkpointing performance in order to flag possible deviations from the expected norm
    • encourage everybody to use the slack channel for daily operations

  • QUOVADIS:
    • WLCG took until Monday afternoon to send the right certificates. They sent wrong certificates over the w/e
    • Nick reminds that the contract CHIPP has with CSCS does not cover after hours and w/e

      [Clarification from Gianfranco]: WLCG has nothing to do with the certificate distribution. The corrected certificate distribution has been made preliminarly available by EGI/IGTF to the sites within a few hours from the incident on Friday. Please refer to the incident post-mortem circulated via e-mail for a full summary and a tmeline of the incident and response

  • Monitoring plots in Gianfranco summaries:
    • The red line is automatically drawn at 64k (CSCS) + 42 (Bern) = 106k
    • the numbers reported in the slides' text are computed from the average (from the plots) multiplied by #days

  • DPM:
    • (from last month Gianfranco's slides) Migration of CSCS ATLAS dCache to DPM still needs more discussion on what the best strategy is to achieve this. Mainly with ATLAS DDM
    • Nick: what is the timescale for the decision ?

      [Added by Gianfranco]: we foresee a communication to CSCS by end of Q1 2121

  • Storage deployment :
    • CMS computing coordinator asked when the pledges for 2021 will be deployed
    • We will follow the usual 1st April schedule, but for storage depends on the hardware delivery

Nick:

  • Monthly numbers provided by Nick match the ones in Gianfranco's slides.
  • The cumulative over the year, is basically spot on pledges - the over-delivery of July/August compensating for the Sept/Jan under-delivery
  • CHIPP overall is still at 111%
  • EGI-SVG-2020-17013 has NOT been applied
    • “After the patch has been applied, the job discovery functionality for ldap+gridftp will be broken to some extent, but should keep working for ATLAS and CMS robot certificates.”
    • “Apart from this, the client tools will continue to fully work and other submission systems may be unaffected.”
    • This sounds like it will break workflows.
Nick comments on Gianfranco slides:

  • still unclear why ATLAS is sending jobs to the small CE for the past months ? This is now mitigated because they got merged

    [Comment by Gianfranco]: what is the "small CE"? ATLAS sends jobs to all CEs configured: arc04 and arc05. It has always been like this and is correct. The merging (that we have asked for) doesn't act as "mitigator" for anything, it results in improved operational efficiency for CSCS and ensures some level of service redundancy.

  • Cloud manager suggested to escalate to the International Computing Board - I proposed not to do that yet. But if things aren’t fixed, it will come
    • Who is the cloud manager ?

      [Comment by Gianfranco]: ATLAS internal organisation (cloud=>regional centre)

Derek :

  • Thursday morning spent with Dario/Elia on problems with roots on storage01:
    • Derek suggestions meet 2h one afternoon to get together and re-sync our common knowledge of the system (a doodle will be sent out in ~a month)
    • distributed responsibility: the factorization of the responsibilities/understanding of the system can slow the debugging
  • T2 space filled up because of the TiB / TB mismatch: Now solved

Roland:

  • GGUS: Dino/Gianni will pick up the ticket from Roland. Vladimir should check
  • Singularity from LHCb running in the payload. Can CSCS remove theirs and rely on LHCb one ?



Followup from previous Action Items

Action items

* Validation of January accounting data *

Direct link for data validation is:

https://wlcg-cric.cern.ch/wlcg/accdata/list/

You can find detailed instructions how to proceed on the twiki page:

https://twiki.cern.ch/twiki/bin/view/LCG/CRICDataValidationInstruction

We ask you to validate December data before the 22st of February. January data will be blocked on the 22nd of February and accounting reports generated after that.

VO reports

ATLAS

CMS

LHCb

T2 Sites reports

CSCS

  • CHIPPreportJan2021.pdf: CSCS January Site Report

    HIGHLIGHTS:

    5th month in a raw of underdeliver: 74% of pledge in January
    Delivered 61% of pledge over the last 5 months
    Cloud manager suggested to escalate to the International Computing Board
    I proposed not to do that yet. But if things aren’t fixed, it will come at some point (independent on me)

UniBe

T3 Sites reports

PSI

UniGe

EGI / WLCG

  • EGI accounting for CSCS: Issues with bad HEPSPEC configuration in ARC since July 2020
    • Wallclock not scaled by the HS coefficient for these months => results in much lover values in the APEL database:
      https://accounting.egi.eu/egi/ngi/NGI_CH/normelap_processors/SITE/DATE/2020/1/2021/1/egi/onlyinfrajobs/

    • Double-publishing (or stale records in the APEL DB) for August 2020
      https://accounting.egi.eu/egi/ngi/NGI_CH/elap_processors/SITE/DATE/2020/1/2021/1/egi/onlyinfrajobs/

    • Site should decide if to republish to correct values for these months or not
    • If yes, let me know if I should follow up with a ticket with APEL or the sites prefers to do that themselves

  • QuoVadis intermediate certificate incident on 15th Jan:
    • QuoVadis rolled over their intermediate certificate but did not communicate to EGI for inclusion of the new cert in the EGI Trustanchors
    • This did not only affect the Grid operations, also everyone using SSL certificates nationwide from QV
    • The Bern CA, QV and EGI remedied to the issues within 2h of the issue becomingg known (GGUS tickets to sites about authentication issues) and the new certificate was pushed into the egi-upcoming.repo and available to sites for installation
    • a detailed post-mortem with timeline of the incident and the remedies will be circulated to the lcg mailing list

Review of open tickets

https://ggus.eu/index.php?mode=ticket_search&supportunit=NGI_CH&status=open&timeframe=any&orderticketsby=REQUEST_ID&orderhow=desc&search_submit=GO

7 of 7 Tickets
Ticket-ID Type VO Site Priority Resp. Unit Status Last Update Subject Scope
150587 lhcb CSCS-LCG2 top priority NGI_CH assigned 2021-02-11 Pilos Faild/Aborted at CSCS-LCG2 WLCG
150585   cms CSCS-LCG2 urgent NGI_CH in progress 2021-02-11 XRootD -read tests for one CE failiing ... WLCG
150484   cms CSCS-LCG2 less urgent NGI_CH in progress 2021-02-08 enabling AREX service at ARC-CE(s) at ... WLCG
150477   cms CSCS-LCG2 urgent NGI_CH in progress 2021-02-09 Transfers failing to T2_CH_CSCS WLCG
150373   dune UNIBE-LHEP less urgent NGI_CH assigned 2021-01-27 Enable DUNE queue for CPU and future ... EGI
149166   cms CSCS-LCG2 less urgent NGI_CH assigned assigned 2021-01-28 CVMFS squids at CSCS WLCG
144485   none CSCS-LCG2 less urgent NGI_CH assigned in progress 2020-12-18 Upgrade to recent dCache release EGI

a.o.b
  • Attendants
  • CSCS: Nick, Dario, Colin, Elia
  • CMS: Derek, Mauro
  • ATLAS:
  • LHCb: Roland
  • EGI:

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf ATLAST2ReportJan2021.pdf r1 manage 389.8 K 2021-02-11 - 13:34 GianfrancoSciacca ATLAS CH Tier2 report
PDFpdf CHIPPreportJan2021.pdf r1 manage 1270.8 K 2021-02-11 - 12:28 NickCardo CSCS January Site Report
PDFpdf UnibeT2ReportJan2021.pdf r1 manage 173.7 K 2021-02-11 - 13:34 GianfrancoSciacca UNIBE-LHEP T2 report
PDFpdf wlcg-swissops-cms-20210211.pdf r3 r2 r1 manage 2083.6 K 2021-02-11 - 14:16 DerekFeichtinger CMS report

This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingSwissGridOperations20210211
Topic revision: r7 - 2021-06-18 - MauroDonega
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback