Swiss Grid Operations Meeting on 2015-06-10

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 109305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 109305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Swiss Grid Operations Meeting on 2015-06-10
- Site status
  - CSCS
  - PSI
  - UNIBE-LHEP
  - UNIBE-ID
  - UNIGE
  - NGI_CH
- Other topics
- A.O.B.
- Attendants
- Action items

Site status

CSCS

Scheduled Downtime 20.05.2015
- DCS3700 storage controllers firmware updated
Scheduled Downtime 08.06.2015
- dCache updated to minor version in order to solve some issues noticed in logs. From 2.6.45 to 2.6.50
- Restarted dCache services helped to improve the load peak on WNs as well since lots of transfers were hanging
- Changed the pool selection mechanism in preparation for dCache 2.10
- Updated tzdata, openssl and other base packages on dCache servers.
- Increased the following parameters of the configuration:
```
gsiftpMaxStreamsPerClient=20 gsiftpMaxLogin=200
```
- Increased transfers open from/to WAN on all pools from 8 to 16 to cope with new load.
ARC Configuration
- arc01 eventually working (atlas jobs) and passing (almost) all checks on NGI Nagios; a minor tuning may be still necessary and the configuration via Puppet to be completed
Other Operations
Attending Pre-GDB meeting on 12.05.2015 at CERN (GR)
- quite interesting and dedicated to on going efforts of Grid sites migrating out of LRMS not supported anymore
- most of sites interested in HTCondor, only a few to SLURM and SGE (Univa or SoGE)
- several issues related to HTCondor discussed:
  - multicore jobs: the configuration is not trivial regarding prioritization, fair-share and back-filling
  - BDII issues related to HTCondor (out of the box support for BDII is quite old) but also the future of BDII within WLCG has been briefly discussed
- cgroups issues on RHEL6 reported by people at DESY, KIT and RAL
- ARC accounting issues (Jura) reported and discussed: interesting slides by John Gordon
- alternatives to CREAM CEs discussed as well (ARC, Condor CE):
  - most of sites interested to adopt ARC CEs
  - the future of CREAM is quite uncertain and several sites are already dismissing their CREAM CEs: several people hope that WLCG will release official notes about which CEs will be supported and which ones are recommended to new sites
  - the main issue about ARC is accounting: working for the most part, but several issues still open (re-publication, Jura glitches, etc)
  - ARC developers are aware of that and looking for solutions (publishing via APEL client? Extracting data directly from LRMS's DB?)
  - HTCondor CE discussed as well:
    - several sites interested since it obviously integrates quite well with HTCondor
    - open issues: its integration of BDII and APEL far from be done
    - packaged now by OSG and all US T1 and T2s run it
    - WLCG should re-package it?
- ARC support to the four main WLCG VOs seems to be quite viable and already implemented by several sites
- ARC workshop this Autumn: one day should be dedicated to setting up and configuring an ARC CE

PSI

Mainly preparing for the Summer leaves
- No further developments in this stage
- Reviewing and updating the full T3 doc
- Improving the Puppet recipes
- Enhancing the Nagios checks
- Making the handover to Derek
dCache / PhEDEX
- We managed to clean 350TB of old user/CMS data at PSI. It took me a lot of time to contact the users.
- CSCS, be aware that in 2.10 if you change the GID of a file you'll get its icrtime field resetted to '01-01-1970' ! The dCache Team acknowledged this bug :
```
thanks for your input. We know about this behavior in 2.10 and 2.11. It is fixed from 2.12 on.
```
- My proposal to N.Magini to explicitly report delegate=true in the PhEDEx configurations has been accepted
NetApp E5400
- Got this warning:
```
Node ID: T3_CMS_E5460_01 Event Error Code: 2836 Event occurred: May 29, 2015 10:36:03 AM Event Message: Discrete lines diagnostic failure Event Priority: Critical Component Type: Battery Pack
```
- That led to ticket 2005687906 an in turn to the NetApp proposal to replace the Raid controller B ; the replacement succeeded
- At the end of the day this was a nice experience, users noticed nothing, no downtimes, the RDAC multipath driver in Linux worked nicely.
- The NetApp support is in China nowadays.. the different timezone doesn't speed up the communications.
Son of Grid Engine 8.1.8
- No time to make further progresses here, regrettably.

UNIBE-LHEP

Operations
- Still at about half capacity
- Aircon issues on 21st May, caused loss of power on more than half of the WNs. Partly restored, but many nodes crashed straight away after power-up/ re-install and are still down
- Lustre unstable on ce02 following power-up of the dead nodes, had to reformat it from scratch. Also needed power-cycling the MDS ( sd 6:2:0:0: rejecting I/O to offiline device )
- Issues since ARC 5:
  - a-rex crashes still happening (caught by cron). Nordugrid developers are aware, but no progress on this so far.
  - Bug in a-rex triggering massive verbosity of the grid-manager.log causing the /var partition to become full. Hit us 6 times on both CEs. It doesn't occur if setting "debug=0" (no logging, not ideal). There is a patched a-rex RPM in the nightly builds, but resolving the dependancies in a satisfactorily way is tricky (did not succeed so far)
  - Bug (?) in a-rex causing the a-rex infoprovider to stop updating the bdii (cluster drops out of the GIIS). Manual workaroud: crons to restart a-rex, then nordugrid-arc-ldap-infosys
  - controldir stuffed with over 100k files (0k) and as many directories in "joblinks". A through cleanup took a full night. This causes a-rex instabilities, ed eventually prevented it from starting at all. Tons of <defunct> processes, likely dur to the restart crons
  - obscure corruption in control dir (prevents a-rex from starting). Cure: rm -rf /var/spool/nordugrid/jobstatus/gm.fifo
ATLAS specific operations
- gridengine refuses to run multicore jobs (not 100%)
  - cron to qalter -R y for the mcore jobs in the queue
  - allow 20 reservations: # qconf -ssconf|grep reservation max_reservation 20
Ongoing work
- prototyping ROCKS 6.1.1 deployment for re-installation (ROCKS 6.1 does not support newer hardware)
- 320 old SunBlade cores delivered from Ubelix, awaiting rackmounting, installation
- 6 IBM servers from CSCS to be picked up
- Temperature monitor in the room desirable, lookign at self-made solutions

UNIBE-ID

Operations
- smooth operation, except see below
- decomissioned and dumped nearly all old Sun hardware
  - all Sun Blade Chassis and Blades move to Gianfranco
- further progress in moving to Puppet managed environment, very promising
ATLAS specific operations
- problems with grid-mapfile for atlas:
  - fetching this works: vomss://voms2.cern.ch:8443/voms/atlas
  - this stopped working last week: vomss://voms2.cern.ch:8443/voms/atlas?/atlas/Role=production
- consequence: no jobs were allowed since last Saturday after all DNs finally were dropped
- Investigations on operator side currently in progress

UNIGE

NGI_CH

Update on certificates:
- 1 user certificate and 1 host certificate (voms.lhep.unibe.ch) requested and issued
- User cert OK, will verify the hostcert in the next few days

A.O.B.

Attendants

CSCS: Dino Conciatore, Gianni Ricciardi, Dario Petrusic. Apologies: Miguel Gila, Nick Cardo.
CMS: Daniel Meister, Fabio Martinelli
ATLAS: Gianfranco Sciacca
LHCb: Roland Bernet
EGI:Gianfranco Sciacca

Action items

Item1

This topic: LCGTier2 > WebHome > MeetingsBoard > MeetingSwissGridOperations20150610
Topic revision: r15 - 2015-06-10 - RolandBernet

Swiss Grid Operations Meeting on 2015-06-10

Site status

CSCS

PSI

UNIBE-LHEP

UNIBE-ID

UNIGE

NGI_CH

Other topics

A.O.B.

Attendants

Action items