Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2015-06-10 * *Date and time*: First Thursday of the month, at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * *Scheduled Downtime 20.05.2015* * DCS3700 storage controllers firmware updated * *Scheduled Downtime 08.06.2015* * dCache updated to minor version in order to solve some issues noticed in logs. From =2.6.45= to =2.6.50= * Restarted dCache services helped to improve the load peak on WNs as well since lots of transfers were hanging * Changed the pool selection mechanism in preparation for dCache 2.10 * Updated tzdata, openssl and other base packages on dCache servers. * Increased the following parameters of the configuration: <pre>gsiftpMaxStreamsPerClient=20 gsiftpMaxLogin=200</pre> * Increased transfers open from/to WAN on all pools from 8 to 16 to cope with new load. * *ARC Configuration* * =arc01= eventually working (atlas jobs) and passing (almost) all checks on NGI Nagios; a minor tuning may be still necessary and the configuration via Puppet to be completed * *Other Operations* * *Attending [[https://indico.cern.ch/event/319821/][Pre-GDB meeting on 12.05.2015 at CERN]]* (GR) * quite interesting and dedicated to on going efforts of Grid sites migrating out of LRMS not supported anymore * most of sites interested in HTCondor, only a few to SLURM and SGE (Univa or SoGE) * several issues related to HTCondor discussed: * multicore jobs: the configuration is not trivial regarding prioritization, fair-share and back-filling * BDII issues related to HTCondor (out of the box support for BDII is quite old) but also the future of BDII within WLCG has been briefly discussed * cgroups issues on RHEL6 reported by people at DESY, KIT and RAL * ARC accounting issues (Jura) reported and discussed: interesting [[https://indico.cern.ch/event/319821/session/0/material/0/5.pdf][slides by John Gordon]] * alternatives to CREAM CEs discussed as well (ARC, Condor CE): * most of sites interested to adopt ARC CEs * the future of CREAM is quite uncertain and several sites are already dismissing their CREAM CEs: several people hope that WLCG will release official notes about which CEs will be supported and which ones are recommended to new sites * the main issue about ARC is accounting: working for the most part, but several issues still open (re-publication, Jura glitches, etc) * ARC developers are aware of that and looking for solutions (publishing via APEL client? Extracting data directly from LRMS's DB?) * HTCondor CE discussed as well: * several sites interested since it obviously integrates quite well with HTCondor * open issues: its integration of BDII and APEL far from be done * packaged now by OSG and all US T1 and T2s run it * WLCG should re-package it? * ARC support to the four main WLCG VOs seems to be quite viable and already implemented by several sites * ARC workshop this Autumn: one day should be dedicated to setting up and configuring an ARC CE ---+++ PSI * *Mainly preparing for the Summer leaves* * *No further developments in this stage* * Reviewing and updating the full T3 doc * Improving the Puppet recipes * Enhancing the Nagios checks * Making the handover to Derek * dCache / PhEDEX * We managed to clean *350TB* of old user/CMS data at PSI. It took me a lot of time to contact the users. * %RED%CSCS%ENDCOLOR%, be aware that in 2.10 if you change the GID of a file you'll get its =icrtime= field resetted to '01-01-1970' ! The dCache Team acknowledged this bug : <pre>thanks for your input. We know about this behavior in 2.10 and 2.11. It is fixed from 2.12 on.</pre> * My proposal to N.Magini to explicitly report delegate=true in the PhEDEx configurations has been [[https://github.com/dmwm/PHEDEX/commit/35d0ee3d1cc376a00e9b39b033dc7641aa16b2c1][accepted]] * NetApp E5400 * Got this warning: <pre>Node ID: T3_CMS_E5460_01 Event Error Code: 2836 Event occurred: May 29, 2015 10:36:03 AM Event Message: Discrete lines diagnostic failure Event Priority: Critical Component Type: Battery Pack</pre> * That led to ticket [[http://mysupport.netapp.com/portal?_nfpb=true&_st=&_pageLabel=caseDetailsPage&initialPage=true&caseNumber=2005687906][2005687906]] an in turn to the NetApp proposal to replace the Raid controller B ; the replacement succeeded * At the end of the day this was a nice experience, users noticed nothing, no downtimes, the RDAC multipath driver in Linux worked nicely. * The NetApp support is in China nowadays.. the different timezone doesn't speed up the communications. * [[https://arc.liv.ac.uk/trac/SGE][Son of Grid Engine 8.1.8]] * No time to make further progresses here, regrettably. ---+++ UNIBE-LHEP * *Operations* * Still at about half capacity * Aircon issues on 21st May, caused loss of power on more than half of the WNs. Partly restored, but many nodes crashed straight away after power-up/ re-install and are still down * Lustre unstable on ce02 following power-up of the dead nodes, had to reformat it from scratch. Also needed power-cycling the MDS ( =sd 6:2:0:0: rejecting I/O to offiline device= ) * Issues since ARC 5: * a-rex crashes still happening (caught by cron). Nordugrid developers are aware, but no progress on this so far. * Bug in a-rex triggering massive verbosity of the grid-manager.log causing the /var partition to become full. Hit us 6 times on both CEs. It doesn't occur if setting "debug=0" (no logging, not ideal). There is a patched a-rex RPM in the nightly builds, but resolving the dependancies in a satisfactorily way is tricky (did not succeed so far) * Bug (?) in a-rex causing the a-rex infoprovider to stop updating the bdii (cluster drops out of the GIIS). Manual workaroud: crons to restart a-rex, then nordugrid-arc-ldap-infosys * <span style="background-color: transparent;">controldir stuffed with over 100k files (0k) and as many directories in "joblinks". A through cleanup took a full night. This causes a-rex instabilities, ed eventually prevented it from starting at all. Tons of =<defunct>= processes, likely dur to the restart crons</span> * <span style="background-color: transparent;">obscure corruption in control dir (prevents a-rex from starting). Cure: =rm -rf /var/spool/nordugrid/jobstatus/gm.fifo= </span> * *ATLAS specific operations* * <span style="background-color: transparent;">gridengine refuses to run multicore jobs (not 100%)</span> * <span style="background-color: transparent;">cron to =qalter -R y= for the mcore jobs in the queue</span> * <span style="background-color: transparent;">allow 20 reservations: =# qconf -ssconf|grep reservation<br /> max_reservation 20= </span> * *Ongoing work* * prototyping ROCKS 6.1.1 deployment for re-installation (ROCKS 6.1 does not support newer hardware) * 320 old SunBlade cores delivered from Ubelix, awaiting rackmounting, installation * 6 IBM servers from CSCS to be picked up * Temperature monitor in the room desirable, lookign at self-made solutions * ---+++ UNIBE-ID * *Operations* * smooth operation, except see below * decomissioned and dumped nearly all old Sun hardware * all Sun Blade Chassis and Blades move to Gianfranco * further progress in moving to Puppet managed environment, very promising * *ATLAS specific operations* * problems with grid-mapfile for atlas: * <span style="background-color: transparent;">fetching this works: </span><span style="background-color: transparent;">vomss://voms2.cern.ch:8443/voms/atlas</span> * <span style="background-color: transparent;">this stopped working last week: </span><span style="background-color: transparent;">vomss://voms2.cern.ch:8443/voms/atlas?/atlas/Role=production</span> * consequence: no jobs were allowed since last Saturday after all DNs finally were dropped * Investigations on operator side currently in progress ---+++ UNIGE * Xxx ---+++ NGI_CH * Update on certificates: * 1 user certificate and 1 host certificate (voms.lhep.unibe.ch) requested and issued * User cert OK, will verify the hostcert in the next few days ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: Dino Conciatore, Gianni Ricciardi, Dario Petrusic. Apologies: Miguel Gila, Nick Cardo. * CMS: Daniel Meister, Fabio Martinelli * ATLAS: Gianfranco Sciacca * LHCb: Roland Bernet * EGI:Gianfranco Sciacca ---++ Action items * Item1
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r15
<
r14
<
r13
<
r12
<
r11
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r15 - 2015-06-10
-
RolandBernet
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback