<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2019-12-05 at 14:00 Check calendar invitation for CSCS Zoom details. <br />%TOC% ---++ Action items * All VOs: identify if jobs failing during a maintenance are accounted as failed or ignored. * All VOs: validate accounting data for Nov 2019 vs. CSCS accounting. %ICON{up}% * Miguel: produce an example command to pull accounting data off a Slurm cluster. * Nick: make VO utilisation charts available at each meeting. ---++ Site status ---+++ CSCS * [[%ATTACHURL%/CHIPPreportNov2019.pdf][CHIPPreportNov2019.pdf]]: CSCS November Report * Nick would like to identify if jobs failing during a maintenance are accounted as failed for the site, or ignored. Action Item on all VOs. * Christoph would like to see if the VO utilisation charts can be made available at each meeting. Action Item on Nick. * Christoph would like VOs to validate their accounting data for Nov 2019 vs. CSCS accounting data. Action Item on all VOs. * Derek needs the commands to pull accounting data off a Slurm cluster that produces the numbers shown in the slides available for the meeting. Action Item on Miguel. * Response time during Christmas at CSCS will be limited as the site will be closed. CSCS is putting additional efforts to make the system even more reliable. ---+++ PSI * [[http://t3mon.psi.ch/ganglia/PSIT3-custom/accounting.txt][Accounting numbers (from scheduler) from last month]] ---+++ UNIBE-LHEP * No report. ---+++ UNIBE-ID * Some job errors due to storage problems. The cause of this issue were bad IB cables, mechanically damaged during the server room reconstruction. * Some cables replaced, the rest will get replaced in the next downtime on 19-12-12 * ARC CE otherwise running smoothly ---+++ UNIGE * No report. ---+++ NGI_CH * Report on this ticket:<br />REFERENCE LINK: [[https://ggus.eu/index.php?mode=ticket_info&ticket_id=144342]]<br />SUBJECT: NGI_CH - November 2019 - RP/RC OLA performance <br /><br /> <div id="1575540900.318500"> <div dir="auto"> <p>such tickets are a “standard formulation”, we have received tons in the past, affecting all sites, due to the fact that the ops probes failures go inevitably undetected, when these do not affect the production experiments. In this specific case, it is the first time the ticket has been also notified to the site. In the past, it was just assigned to the NGI_CH, so only I would receive notification. Then would do some investigation with the site, and report on the ticket. In some cases, Dario and Dino might remember, we never found the cause of some errors that appeared and went away on their own.</p> </div> </div> <div id="1575541045.320300"> <p> </p> <p>F [[https://cscs-lcg.slack.com/archives/C1H1XBS14/p1575541045320300][or this specific ticket, I had spotted the error (by pure chance) and reported to dario, it was corrected within <24h on the 22nd Nov.In fact, it can be seen in this link that the availability goes back to 100% by the end of that day https://egi.ui.argo.grnet.gr/egi/report-ar-group/Critical/2019-11/SITES/CSCS-LCG2]]</p> </div> <br />I also see during that perios issues affecting the ARC CEs, but these went away spontaneously and it is no longer easy to investigate what happened back at failure times.<br /><br />To mitigate in the future, we have mentioned in the past that there exist the possibility of turning on notification at the site/service level in GOCDB. These will trigegr email to the GOCDB site contact in case some ops probes fail. Each site should choose their own matrix of notifications. There are two independent levels: site level (can be turned on by editing the main site page), and service level (can be turned on by editing each servic page)<br /><br /><br /> * NGI-CH Open Tickets review ---++ Other topics Next meeting date: Jan 09, 2020 at 14:00 Zurich time. Same Zoom connection details. ---++ A.O.B. * Mauro points that ATLAS is not running a lot recently, Nick informs him that this is due to fair-share catching up because LHCb did not run in the last weeks. This could potentially be a problem due to how ATLAS workflows are, which penalisesites that show peaks. Mauro wants to know whether we can set QoS in a way that VOs always get a minimum and maximum chunk of the resources available. The answer is that we can, but the CSCS cannot be accountable for the number of CPU hours lost if there are no jobs in the queue. A possible alternative would be to tune priorities. * Vinzenz: Dino and Dario are out on vacation. We suggest to write emails to [[mailto:grid@cscs.ch][grid@cscs.ch]] instead of to personal people. Some CMS people seem to have problems accessing files at the T2, so Vinzenz will open a ticket to [[mailto:grid-rt@cscs.ch][grid-rt@cscs.ch]] so CSCS can follow up. ---++ Attendants * CSCS: Nicholas Cardo, Miguel Gila, Gianni Ricciardi * CMS: Derek Feichtinger, Vinzenz Stampf * ATLAS: * LHCb: * EGI: * CHIPP: Christoph Grab, Mauro Donega ---++ * [[%ATTACHURL%/CHIPPreportNov2019.pdf][CHIPPreportNov2019.pdf]]: CSCS November Report
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
pdf
CHIPPreportNov2019.pdf
r1
manage
879.5 K
2019-12-05 - 13:01
NickCardo
CSCS November Report
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingSwissGridOperations20191205
Topic revision: r6 - 2019-12-12 - MiguelGila
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback