<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup --> ---+ Swiss Grid Operations Meeting on 2015-06-10 * *Date and time*: First Thursday of the month, at 14:00 * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236) * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign) * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email) %TOC% ---++ Site status ---+++ CSCS * *Scheduled Downtime 20.05.2015* * DCS3700 storage controllers firmware updated * *Scheduled Downtime 08.06.2015* * dCache updated to minor version in order to solve some issues noticed in logs. From =2.6.45= to =2.6.50= * Restarted dCache services helped to improve the load peak on WNs as well since lots of transfers were hanging * Changed the pool selection mechanism in preparation for dCache 2.10 * Updated tzdata, openssl and other base packages on dCache servers. * Increased the following parameters of the configuration: <pre>gsiftpMaxStreamsPerClient=20 gsiftpMaxLogin=200</pre> * Increased transfers open from/to WAN on all pools from 8 to 16 to cope with new load. * *ARC Configuration* * =arc01= eventually working (atlas jobs) and passing (almost) all checks on NGI Nagios; a minor tuning may be still necessary and the configuration via Puppet to be completed * *Other Operations* * *Attending [[https://indico.cern.ch/event/319821/][Pre-GDB meeting on 12.05.2015 at CERN]]* (GR) * quite interesting and dedicated to on going efforts of Grid sites migrating out of LRMS not supported anymore * most of sites interested in HTCondor, only a few to SLURM and SGE (Univa or SoGE) * several issues related to HTCondor discussed: * multicore jobs: the configuration is not trivial regarding prioritization, fair-share and back-filling * BDII issues related to HTCondor (out of the box support for BDII is quite old) but also the future of BDII within WLCG has been briefly discussed * cgroups issues on RHEL6 reported by people at DESY, KIT and RAL * ARC accounting issues (Jura) reported and discussed: interesting [[https://indico.cern.ch/event/319821/session/0/material/0/5.pdf][slides by John Gordon]] * alternatives to CREAM CEs discussed as well (ARC, Condor CE): * most of sites interested to adopt ARC CEs * the future of CREAM is quite uncertain and several sites are already dismissing their CREAM CEs: several people hope that WLCG will release official notes about which CEs will be supported and which ones are recommended to new sites * the main issue about ARC is accounting: working for the most part, but several issues still open (re-publication, Jura glitches, etc) * ARC developers are aware of that and looking for solutions (publishing via APEL client? Extracting data directly from LRMS's DB?) * HTCondor CE discussed as well: * several sites interested since it obviously integrates quite well with HTCondor * open issues: its integration of BDII and APEL far from be done * packaged now by OSG and all US T1 and T2s run it * WLCG should re-package it? * ARC support to the four main WLCG VOs seems to be quite viable and already implemented by several sites * ARC workshop this Autumn: one day should be dedicated to setting up and configuring an ARC CE ---+++ PSI * *Mainly preparing for the Summer leaves* * *No further developments in this stage* * Reviewing and updating the full T3 doc * Improving the Puppet recipes * Enhancing the Nagios checks * Making the handover to Derek * dCache / PhEDEX * We managed to clean *350TB* of old user/CMS data at PSI. It took me a lot of time to contact the users. * %RED%CSCS%ENDCOLOR%, be aware that in 2.10 if you change the GID of a file you'll get its =icrtime= field resetted to '01-01-1970' ! The dCache Team acknowledged this bug : <pre>thanks for your input. We know about this behavior in 2.10 and 2.11. It is fixed from 2.12 on.</pre> * My proposal to N.Magini to explicitly report delegate=true in the PhEDEx configurations has been [[https://github.com/dmwm/PHEDEX/commit/35d0ee3d1cc376a00e9b39b033dc7641aa16b2c1][accepted]] * NetApp E5400 * Got this warning: <pre>Node ID: T3_CMS_E5460_01 Event Error Code: 2836 Event occurred: May 29, 2015 10:36:03 AM Event Message: Discrete lines diagnostic failure Event Priority: Critical Component Type: Battery Pack</pre> * That led to ticket [[http://mysupport.netapp.com/portal?_nfpb=true&_st=&_pageLabel=caseDetailsPage&initialPage=true&caseNumber=2005687906][2005687906]] an in turn to the NetApp proposal to replace the Raid controller B ; the replacement succeeded * At the end of the day this was a nice experience, users noticed nothing, no downtimes, the RDAC multipath driver in Linux worked nicely. * The NetApp support is in China nowadays.. the different timezone doesn't speed up the communications. * [[https://arc.liv.ac.uk/trac/SGE][Son of Grid Engine 8.1.8]] * No time to make further progresses here, regrettably. ---+++ UNIBE-LHEP * *Operations* * Still at about half capacity * Aircon issues on 21st May, caused loss of power on more than half of the WNs. Partly restored, but many nodes crashed straight away after power-up/ re-install and are still down * Lustre unstable on ce02 following power-up of the dead nodes, had to reformat it from scratch. Also needed power-cycling the MDS ( =sd 6:2:0:0: rejecting I/O to offiline device= ) * Issues since ARC 5: * a-rex crashes still happening (caught by cron). Nordugrid developers are aware, but no progress on this so far. * Bug in a-rex triggering massive verbosity of the grid-manager.log causing the /var partition to become full. Hit us 6 times on both CEs. It doesn't occur if setting "debug=0" (no logging, not ideal). There is a patched a-rex RPM in the nightly builds, but resolving the dependancies in a satisfactorily way is tricky (did not succeed so far) * Bug (?) in a-rex causing the a-rex infoprovider to stop updating the bdii (cluster drops out of the GIIS). Manual workaroud: crons to restart a-rex, then nordugrid-arc-ldap-infosys * <span style="background-color: transparent;">controldir stuffed with over 100k files (0k) and as many directories in "joblinks". A through cleanup took a full night. This causes a-rex instabilities, ed eventually prevented it from starting at all. Tons of =<defunct>= processes, likely dur to the restart crons</span> * <span style="background-color: transparent;">obscure corruption in control dir (prevents a-rex from starting). Cure: =rm -rf /var/spool/nordugrid/jobstatus/gm.fifo= </span> * *ATLAS specific operations* * <span style="background-color: transparent;">gridengine refuses to run multicore jobs (not 100%)</span> * <span style="background-color: transparent;">cron to =qalter -R y= for the mcore jobs in the queue</span> * <span style="background-color: transparent;">allow 20 reservations: =# qconf -ssconf|grep reservation<br /> max_reservation 20= </span> * *Ongoing work* * prototyping ROCKS 6.1.1 deployment for re-installation (ROCKS 6.1 does not support newer hardware) * 320 old SunBlade cores delivered from Ubelix, awaiting rackmounting, installation * 6 IBM servers from CSCS to be picked up * Temperature monitor in the room desirable, lookign at self-made solutions * ---+++ UNIBE-ID * *Operations* * smooth operation, except see below * decomissioned and dumped nearly all old Sun hardware * all Sun Blade Chassis and Blades move to Gianfranco * further progress in moving to Puppet managed environment, very promising * *ATLAS specific operations* * problems with grid-mapfile for atlas: * <span style="background-color: transparent;">fetching this works: </span><span style="background-color: transparent;">vomss://voms2.cern.ch:8443/voms/atlas</span> * <span style="background-color: transparent;">this stopped working last week: </span><span style="background-color: transparent;">vomss://voms2.cern.ch:8443/voms/atlas?/atlas/Role=production</span> * consequence: no jobs were allowed since last Saturday after all DNs finally were dropped * Investigations on operator side currently in progress ---+++ UNIGE * Xxx ---+++ NGI_CH * Update on certificates: * 1 user certificate and 1 host certificate (voms.lhep.unibe.ch) requested and issued * User cert OK, will verify the hostcert in the next few days ---++ Other topics * Topic1 * Topic2 Next meeting date: ---++ A.O.B. ---++ Attendants * CSCS: Dino Conciatore, Gianni Ricciardi, Dario Petrusic. Apologies: Miguel Gila, Nick Cardo. * CMS: Daniel Meister, Fabio Martinelli * ATLAS: Gianfranco Sciacca * LHCb: Roland Bernet * EGI:Gianfranco Sciacca ---++ Action items * Item1
This topic: LCGTier2
>
WebHome
>
MeetingsBoard
>
MeetingSwissGridOperations20150610
Topic revision: r15 - 2015-06-10 - RolandBernet
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback