Tags:
meeting
1
NGI_CH_X509s
1
SwissGridOperationsMeeting
1
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2015-07-02 * *Date and time*: First Thursday of the month, at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * Operations: * dCache overall status * CMS !PhEDEx reinstallation status * Tickets: * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=405][405 | CMS | T2_CH_CSCS Phedex agents down]]: * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=402][402 | CMS | T2_CH_CSCS with CE critical for > 13 hours]]: * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=398][398 | CMS | space monitoring at T2_CH_CSCS]]: lower priority * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=397][397 | CMS | T2_CH_CSCS - links]]: * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=403][403 | LHCb | CPU efficiency at CSCS-LCG2]]: Difficult to identify what's going on as the output from the job cannot be obtained. * [[https://xgus.ggus.eu/ngi_ch/index.php?mode=ticket_info&ticket_id=388][388 | none | Missing Accounting Date for APril 2015]]: Linked to internal WebRT #19446 and #19946 * VO specific tickets: ---+++ PSI * Fabio will be on leave until 6th July ---+++ UNIBE-LHEP * <strong>Operations</strong> * Still bumpy and at about half capacity * Restored 320 old cores (from Ubelix), but many tend to crash * One more aircon issue (22th June). Many nodes lost power. Working on temp monitor system in the room (some input from PSI too, thanks!) * Likely related to aircon problem: lustre disks on 3 nodes went flaky. * More issues with nodes crashing on both clusters. Most of the times jobs remain in state "dr" in gridengine. These somehow prevent new jobs from being submitted (these remain in PREPARING state in ARC). Now added cron to clean these up and log the nodenames. * LAN down on ce01 on (Friday) 26th June. Very likely a hardware failure, but in the rush to reset the cluster online, failed to really establish whether it was really the case. Recovery: swap to unused network interface, register network and interface changes in ROCKS, redploy Lustre from scratch, power-up and re-install stuck nodes. * *ATLAS specific operations* * gridengine multicore scheduling improved. Changes to gridengine already in place a month ago, but success seemed limited. In addition, removed on one cluster a hack to scale up the requested walltime by 1.4/1.5. Increased difficulty in scheduling multicore jobs possibly explained by some ATLAS tasks with quiet high walltime. * All ARC failures still masked by crons. But bugfix release in apel-testing, will try to upgrade soon. * <strong>Ongoing work</strong> * ROCKS 6.2 just came out, prototyping the cluster deployment chain with this version now (CE, WN, lustre mds, lustre oss) * 6 IBM servers from CSCS collected and rackmounted. Will be deployed upon re-installation of ce01 (ROCKS 6.2) * Temperature monitoring in server room under work. New water-cooled rack by Theoretical Psysics monitors the inlet water temperature. Add ambient sensors in some racks. Monitor first to learn trends, try to automate in the future (e.g. drain clusters upon inlet water temperature over threshold) ---+++ UNIBE-ID * Michael/Nico cannot attend due to delivery of ESS ;-) * *Operations*: * smooth, high usage currently * *ATLAS-related*: * mcore jobs now better scheduled; changes made * resource reservation only set for mcore jobs (within submit-sge-jobs when priority is set) * increased max_reservation in scheduler conf from 7 to 32 * default_duration in scheduler conf now increased from 24h to 97h == h_rt limit of queue where ATLAS jobs are running * ATM: WARNING in gridka nagios regarding latest EGI-trustanchors release * IGTF-1.65, 0 days old, all present. - SHA Fingerprint failed for ca-policy-lcg. - SHA Fingerprint failed for ca-policy-egi-core * Is this a broken release? * *UBELIX Puppet Resources* * As mentioned at hpc-forum presentation we now have a public platform for OSS stuff: * http://idos-code.unibe.ch - Stash with most of our puppet modules * http://idos-issues.unibe.ch - Jira, our issue tracker for the code above * Clone as you like. :-) Contributions (aka pull requests) are welcome as soon as we have our Crowd instance ready (end of week) - don't register yet though it's possible. ---+++ UNIGE * Still un-manned, likely until 1st October 2015 ---+++ NGI_CH * Certificates: http://www.lhep.unibe.ch/sits/certificates.html * UNIBE-LHEP bad performance April, May 2015: bad SE for ops (May: <a target="_blank" href="https://documents.egi.eu/public/ShowDocument?docid=2519">https://documents.egi.eu/public/ShowDocument?docid=2519</a>) * <span style="background-color: transparent;">NGI_CH - May 2015 - RP/RC OLA performance: https://ggus.eu/index.php?mode=ticket_info&ticket_id=114449</span> * <span style="background-color: transparent;">ARC staged rollout?</span> ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. * *Reminder:* Face To Face meeting to be held on 21 August 2015 at CSCS. ---++ Attendants * CSCS: Miguel Gila * CMS: * ATLAS: Gianfranco Sciacca * LHCb: * EGI: Gianfranco Sciacca ---++ Action items * Item1
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r10
|
r7
<
r6
<
r5
<
r4
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r5 - 2015-07-01
-
GianfrancoSciacca
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback