Tags:
create new tag
view all tags

Swiss Grid Operations Meeting on 2015-06-10

Site status

CSCS

  • Scheduled Downtime 20.05.2015
    • DCS3700 storage controllers firmware updated
  • Scheduled Downtime 08.06.2015
    • dCache updated to minor version in order to solve some issues noticed in logs. From 2.6.45 to 2.6.50
    • Restarted dCache services helped to improve the load peak on WNs as well since lots of transfers were hanging
    • Changed the pool selection mechanism in preparation for dCache 2.10
    • Updated tzdata, openssl and other base packages on dCache servers.
    • Increased the following parameters of the configuration:
      gsiftpMaxStreamsPerClient=20 gsiftpMaxLogin=200
    • Increased transfers open from/to WAN on all pools from 8 to 16 to cope with new load.
  • ARC Configuration
    • arc01 eventually working (atlas jobs) and passing (almost) all checks on NGI Nagios; a minor tuning may be still necessary and the configuration via Puppet to be completed
  • Other Operations
  • Attending Pre-GDB meeting on 12.05.2015 at CERN (GR)
    • quite interesting and dedicated to on going efforts of Grid sites migrating out of LRMS not supported anymore
    • most of sites interested in HTCondor, only a few to SLURM and SGE (Univa or SoGE)
    • several issues related to HTCondor discussed:
      • multicore jobs: the configuration is not trivial regarding prioritization, fair-share and back-filling
      • BDII issues related to HTCondor (out of the box support for BDII is quite old) but also the future of BDII within WLCG has been briefly discussed
    • cgroups issues on RHEL6 reported by people at DESY, KIT and RAL
    • ARC accounting issues (Jura) reported and discussed: interesting slides by John Gordon
    • alternatives to CREAM CEs discussed as well (ARC, Condor CE):
      • most of sites interested to adopt ARC CEs
      • the future of CREAM is quite uncertain and several sites are already dismissing their CREAM CEs: several people hope that WLCG will release official notes about which CEs will be supported and which ones are recommended to new sites
      • the main issue about ARC is accounting: working for the most part, but several issues still open (re-publication, Jura glitches, etc)
      • ARC developers are aware of that and looking for solutions (publishing via APEL client? Extracting data directly from LRMS's DB?)
      • HTCondor CE discussed as well:
        • several sites interested since it obviously integrates quite well with HTCondor
        • open issues: its integration of BDII and APEL far from be done
        • packaged now by OSG and all US T1 and T2s run it
        • WLCG should re-package it?
    • ARC support to the four main WLCG VOs seems to be quite viable and already implemented by several sites
    • ARC workshop this Autumn: one day should be dedicated to setting up and configuring an ARC CE

PSI

  • Mainly preparing for the Summer leaves
    • No further developments in this stage
    • Reviewing and updating the full T3 doc
    • Improving the Puppet recipes
    • Enhancing the Nagios checks
    • Making the handover to Derek
  • dCache / PhEDEX
    • We managed to clean 350TB of old user/CMS data at PSI. It took me a lot of time to contact the users.
    • CSCS, be aware that in 2.10 if you change the GID of a file you'll get its icrtime field resetted to '01-01-1970' ! The dCache Team acknowledged this bug :
      thanks for your input. We know about this behavior in 2.10 and 2.11. It is fixed from 2.12 on.
    • My proposal to N.Magini to explicitly report delegate=true in the PhEDEx configurations has been accepted
  • NetApp E5400
    • Got this warning:
      Node ID: T3_CMS_E5460_01 Event Error Code: 2836 Event occurred: May 29, 2015 10:36:03 AM Event Message: Discrete lines diagnostic failure Event Priority: Critical Component Type: Battery Pack
    • That led to ticket 2005687906 an in turn to the NetApp proposal to replace the Raid controller B ; the replacement succeeded
    • At the end of the day this was a nice experience, users noticed nothing, no downtimes, the RDAC multipath driver in Linux worked nicely.
    • The NetApp support is in China nowadays.. the different timezone doesn't speed up the communications.
  • Son of Grid Engine 8.1.8
    • No time to make further progresses here, regrettably.

UNIBE-LHEP

  • Operations
    • Still at about half capacity
    • Aircon issues on 21st May, caused loss of power on more than half of the WNs. Partly restored, but many nodes crashed straight away after power-up/ re-install and are still down
    • Lustre unstable on ce02 following power-up of the dead nodes, had to reformat it from scratch. Also needed power-cycling the MDS ( sd 6:2:0:0: rejecting I/O to offiline device )
    • Issues since ARC 5:
      • a-rex crashes still happening (caught by cron). Nordugrid developers are aware, but no progress on this so far.
      • Bug in a-rex triggering massive verbosity of the grid-manager.log causing the /var partition to become full. Hit us 6 times on both CEs. It doesn't occur if setting "debug=0" (no logging, not ideal). There is a patched a-rex RPM in the nightly builds, but resolving the dependancies in a satisfactorily way is tricky (did not succeed so far)
      • Bug (?) in a-rex causing the a-rex infoprovider to stop updating the bdii (cluster drops out of the GIIS). Manual workaroud: crons to restart a-rex, then nordugrid-arc-ldap-infosys
      • controldir stuffed with over 100k files (0k) and as many directories in "joblinks". A through cleanup took a full night. This causes a-rex instabilities, ed eventually prevented it from starting at all. Tons of <defunct> processes, likely dur to the restart crons
      • obscure corruption in control dir (prevents a-rex from starting). Cure: rm -rf /var/spool/nordugrid/jobstatus/gm.fifo
  • ATLAS specific operations
    • gridengine refuses to run multicore jobs (not 100%)
      • cron to qalter -R y for the mcore jobs in the queue
      • allow 20 reservations: # qconf -ssconf|grep reservation
        max_reservation 20
  • Ongoing work
    • prototyping ROCKS 6.1.1 deployment for re-installation (ROCKS 6.1 does not support newer hardware)
    • 320 old SunBlade cores delivered from Ubelix, awaiting rackmounting, installation
    • 6 IBM servers from CSCS to be picked up
    • Temperature monitor in the room desirable, lookign at self-made solutions

UNIBE-ID

  • Operations
    • smooth operation, except see below
    • decomissioned and dumped nearly all old Sun hardware
      • all Sun Blade Chassis and Blades move to Gianfranco
    • further progress in moving to Puppet managed environment, very promising
  • ATLAS specific operations
    • problems with grid-mapfile for atlas:
      • fetching this works: vomss://voms2.cern.ch:8443/voms/atlas
      • this stopped working last week: vomss://voms2.cern.ch:8443/voms/atlas?/atlas/Role=production
    • consequence: no jobs were allowed since last Saturday after all DNs finally were dropped
    • Investigations on operator side currently in progress

UNIGE

  • Xxx

NGI_CH

  • Update on certificates:
    • 1 user certificate and 1 host certificate (voms.lhep.unibe.ch) requested and issued
    • User cert OK, will verify the hostcert in the next few days

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS: Dino Conciatore, Gianni Ricciardi, Dario Petrusic. Apologies: Miguel Gila, Nick Cardo.
  • CMS: Daniel Meister, Fabio Martinelli
  • ATLAS: Gianfranco Sciacca
  • LHCb: Roland Bernet
  • EGI:Gianfranco Sciacca

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r15 - 2015-06-10 - RolandBernet
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback