MeetingSwissGridOperations20150610 < LCGTier2

Tags: view all tags
<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.EgiGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup,Main.ChippComputingBoardGroup
-->

---+ Swiss Grid Operations Meeting on 2015-06-10
   * *Date and time*: First Thursday of the month, at 14:00
   * *Place*: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
   * *External link*: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
   * *Phone gate*: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
   * *IRC chat*: irc:gridchat.cscs.ch:994#lcg (ask pw via email)
%TOC%

---++ Site status
---+++ CSCS
   * *Scheduled Downtime 20.05.2015* 
      * DCS3700 storage controllers firmware updated
   * *Scheduled Downtime 08.06.2015* 
      * dCache updated to minor version in order to solve some issues noticed in logs. From =2.6.45= to =2.6.50=
      * Restarted dCache services helped to improve the load peak on WNs as well since lots of transfers were hanging
      * Changed the pool selection mechanism in preparation for dCache 2.10
      * Updated tzdata, openssl and other base packages on dCache servers.
      * Increased the following parameters of the configuration: <pre>gsiftpMaxStreamsPerClient=20 gsiftpMaxLogin=200</pre>
      * Increased transfers open from/to WAN on all pools from 8 to 16 to cope with new load.
   * *ARC Configuration* 
      * =arc01= eventually working (atlas jobs) and passing (almost) all checks on NGI Nagios; a minor tuning may be still necessary and the configuration via Puppet to be completed
   * *Other Operations*
   * *Attending [[https://indico.cern.ch/event/319821/][Pre-GDB meeting on 12.05.2015 at CERN]]* (GR) 
      * quite interesting and dedicated to on going efforts of Grid sites migrating out of LRMS not supported anymore
      * most of sites interested in HTCondor, only a few to SLURM and SGE (Univa or SoGE)
      * several issues related to HTCondor discussed: 
         * multicore jobs: the configuration is not trivial regarding prioritization, fair-share and back-filling
         * BDII issues related to HTCondor (out of the box support for BDII is quite old) but also the future of BDII within WLCG has been briefly discussed
      * cgroups issues on RHEL6 reported by people at DESY, KIT and RAL
      * ARC accounting issues (Jura) reported and discussed: interesting [[https://indico.cern.ch/event/319821/session/0/material/0/5.pdf][slides by John Gordon]]
      * alternatives to CREAM CEs discussed as well (ARC, Condor CE): 
         * most of sites interested to adopt ARC CEs
         * the future of CREAM is quite uncertain and several sites are already dismissing their CREAM CEs: several people hope that WLCG will release official notes about which CEs will be supported and which ones are recommended to new sites
         * the main issue about ARC is accounting: working for the most part, but several issues still open (re-publication, Jura glitches, etc)
         * ARC developers are aware of that and looking for solutions (publishing via APEL client? Extracting data directly from LRMS's DB?)
         * HTCondor CE discussed as well: 
            * several sites interested since it obviously integrates quite well with HTCondor
            * open issues: its integration of BDII and APEL far from be done
            * packaged now by OSG and all US T1 and T2s run it
            * WLCG should re-package it?
      * ARC support to the four main WLCG VOs seems to be quite viable and already implemented by several sites
      * ARC workshop this Autumn: one day should be dedicated to setting up and configuring an ARC CE

---+++ PSI
   * *Mainly preparing for the Summer leaves* 
      * *No further developments in this stage*
      * Reviewing and updating the full T3 doc
      * Improving the Puppet recipes
      * Enhancing the Nagios checks
      * Making the handover to Derek
   * dCache / PhEDEX 
      * We managed to clean *350TB* of old user/CMS data at PSI. It took me a lot of time to contact the users.
      * %RED%CSCS%ENDCOLOR%, be aware that in 2.10 if you change the GID of a file you'll get its =icrtime= field resetted to '01-01-1970' ! The dCache Team acknowledged this bug : <pre>thanks for your input. We know about this behavior in 2.10 and 2.11. It is fixed from 2.12 on.</pre>
      * My proposal to N.Magini to explicitly report delegate=true in the PhEDEx configurations has been [[https://github.com/dmwm/PHEDEX/commit/35d0ee3d1cc376a00e9b39b033dc7641aa16b2c1][accepted]]
   * NetApp E5400 
      * Got this warning: <pre>Node ID: T3_CMS_E5460_01 Event Error Code: 2836 Event occurred: May 29, 2015 10:36:03 AM Event Message: Discrete lines diagnostic failure Event Priority: Critical Component Type: Battery Pack</pre>
      * That led to ticket [[http://mysupport.netapp.com/portal?_nfpb=true&_st=&_pageLabel=caseDetailsPage&initialPage=true&caseNumber=2005687906][2005687906]] an in turn to the NetApp proposal to replace the Raid controller B ; the replacement succeeded
      * At the end of the day this was a nice experience, users noticed nothing, no downtimes, the RDAC multipath driver in Linux worked nicely.
      * The NetApp support is in China nowadays.. the different timezone doesn't speed up the communications.
   * [[https://arc.liv.ac.uk/trac/SGE][Son of Grid Engine 8.1.8]] 
      * No time to make further progresses here, regrettably.

---+++ UNIBE-LHEP
   * *Operations* 
      * Still at about half capacity
      * Aircon issues on 21st May, caused loss of power on more than half of the WNs. Partly restored, but many nodes crashed straight away after power-up/ re-install and are still down
      * Lustre unstable on ce02 following power-up of the dead nodes, had to reformat it from scratch. Also needed power-cycling the MDS ( =sd 6:2:0:0: rejecting I/O to offiline device= )
      * Issues since ARC 5: 
         * a-rex crashes still happening (caught by cron). Nordugrid developers are aware, but no progress on this so far.
         * Bug in a-rex triggering massive verbosity of the grid-manager.log causing the /var partition to become full. Hit us 6 times on both CEs. It doesn't occur if setting "debug=0" (no logging, not ideal). There is a patched a-rex RPM in the nightly builds, but resolving the dependancies in a satisfactorily way is tricky (did not succeed so far)
         * Bug (?) in a-rex causing the a-rex infoprovider to stop updating the bdii (cluster drops out of the GIIS). Manual workaroud: crons to restart a-rex, then nordugrid-arc-ldap-infosys
         * <span style="background-color: transparent;">controldir stuffed with over 100k files (0k) and as many directories in "joblinks". A through cleanup took a full night. This causes a-rex instabilities, ed eventually prevented it from starting at all. Tons of =&lt;defunct&gt;= processes, likely dur to the restart crons</span>
         * <span style="background-color: transparent;">obscure corruption in control dir (prevents a-rex from starting). Cure: =rm -rf /var/spool/nordugrid/jobstatus/gm.fifo= </span>
   * *ATLAS specific operations* 
      * <span style="background-color: transparent;">gridengine refuses to run multicore jobs (not 100%)</span> 
         * <span style="background-color: transparent;">cron to =qalter -R y= for the mcore jobs in the queue</span>
         * <span style="background-color: transparent;">allow 20 reservations: =# qconf -ssconf|grep reservation<br /> max_reservation 20= </span>
   * *Ongoing work* 
      * prototyping ROCKS 6.1.1 deployment for re-installation (ROCKS 6.1 does not support newer hardware)
      * 320 old SunBlade cores delivered from Ubelix, awaiting rackmounting, installation
      * 6 IBM servers from CSCS to be picked up
      * Temperature monitor in the room desirable, lookign at self-made solutions
   * 
---+++ UNIBE-ID
   * *Operations* 
      * smooth operation, except see below
      * decomissioned and dumped nearly all old Sun hardware 
         * all Sun Blade Chassis and Blades move to Gianfranco
      * further progress in moving to Puppet managed environment, very promising
   * *ATLAS specific operations* 
      * problems with grid-mapfile for atlas: 
         * <span style="background-color: transparent;">fetching this works: </span><span style="background-color: transparent;">vomss://voms2.cern.ch:8443/voms/atlas</span>
         * <span style="background-color: transparent;">this stopped working last week: </span><span style="background-color: transparent;">vomss://voms2.cern.ch:8443/voms/atlas?/atlas/Role=production</span>
      * consequence: no jobs were allowed since last Saturday after all DNs finally were dropped
      * Investigations on operator side currently in progress

---+++ UNIGE
   * Xxx

---+++ NGI_CH
   * Update on certificates: 
      * 1 user certificate and 1 host certificate (voms.lhep.unibe.ch) requested and issued
      * User cert OK, will verify the hostcert in the next few days

---++ Other topics
   * Topic1
   * Topic2
Next meeting date:

---++ A.O.B.

---++ Attendants
   * CSCS: Dino Conciatore, Gianni Ricciardi, Dario Petrusic. Apologies: Miguel Gila, Nick Cardo.
   * CMS: Daniel Meister, Fabio Martinelli
   * ATLAS: Gianfranco Sciacca
   * LHCb: Roland Bernet
   * EGI:Gianfranco Sciacca

---++ Action items
   * Item1