Swiss Grid Operations Meeting on 2015-06-10
Site status
CSCS
- Scheduled Downtime 20.05.2015
- DCS3700 storage controllers firmware updated
- Scheduled Downtime 08.06.2015
- ARC Configuration
-
arc01
eventually working (atlas jobs) and passing (almost) all checks on NGI Nagios; a minor tuning may be still necessary and the configuration via Puppet to be completed
- Other Operations
- Attending Pre-GDB meeting on 12.05.2015 at CERN (GR)
- quite interesting and dedicated to on going efforts of Grid sites migrating out of LRMS not supported anymore
- most of sites interested in HTCondor, only a few to SLURM and SGE (Univa or SoGE)
- several issues related to HTCondor discussed:
- multicore jobs: the configuration is not trivial regarding prioritization, fair-share and back-filling
- BDII issues related to HTCondor (out of the box support for BDII is quite old) but also the future of BDII within WLCG has been briefly discussed
- cgroups issues on RHEL6 reported by people at DESY, KIT and RAL
- ARC accounting issues (Jura) reported and discussed: interesting slides by John Gordon
- alternatives to CREAM CEs discussed as well (ARC, Condor CE):
- most of sites interested to adopt ARC CEs
- the future of CREAM is quite uncertain and several sites are already dismissing their CREAM CEs: several people hope that WLCG will release official notes about which CEs will be supported and which ones are recommended to new sites
- the main issue about ARC is accounting: working for the most part, but several issues still open (re-publication, Jura glitches, etc)
- ARC developers are aware of that and looking for solutions (publishing via APEL client? Extracting data directly from LRMS's DB?)
- HTCondor CE discussed as well:
- several sites interested since it obviously integrates quite well with HTCondor
- open issues: its integration of BDII and APEL far from be done
- packaged now by OSG and all US T1 and T2s run it
- WLCG should re-package it?
- ARC support to the four main WLCG VOs seems to be quite viable and already implemented by several sites
- ARC workshop this Autumn: one day should be dedicated to setting up and configuring an ARC CE
PSI
- Mainly preparing for the Summer leaves
- No further developments in this stage
- Reviewing and updating the full T3 doc
- Improving the Puppet recipes
- Enhancing the Nagios checks
- Making the handover to Derek
- dCache / PhEDEX
- NetApp E5400
- Son of Grid Engine 8.1.8
- No time to make further progresses here, regrettably.
UNIBE-LHEP
- Operations
- Still at about half capacity
- Aircon issues on 21st May, caused loss of power on more than half of the WNs. Partly restored, but many nodes crashed straight away after power-up/ re-install and are still down
- Lustre unstable on ce02 following power-up of the dead nodes, had to reformat it from scratch. Also needed power-cycling the MDS (
sd 6:2:0:0: rejecting I/O to offiline device
)
- Issues since ARC 5:
- a-rex crashes still happening (caught by cron). Nordugrid developers are aware, but no progress on this so far.
- Bug in a-rex triggering massive verbosity of the grid-manager.log causing the /var partition to become full. Hit us 6 times on both CEs. It doesn't occur if setting "debug=0" (no logging, not ideal). There is a patched a-rex RPM in the nightly builds, but resolving the dependancies in a satisfactorily way is tricky (did not succeed so far)
- Bug (?) in a-rex causing the a-rex infoprovider to stop updating the bdii (cluster drops out of the GIIS). Manual workaroud: crons to restart a-rex, then nordugrid-arc-ldap-infosys
- controldir stuffed with over 100k files (0k) and as many directories in "joblinks". A through cleanup took a full night. This causes a-rex instabilities, ed eventually prevented it from starting at all. Tons of
<defunct>
processes, likely dur to the restart crons
- obscure corruption in control dir (prevents a-rex from starting). Cure:
rm -rf /var/spool/nordugrid/jobstatus/gm.fifo
- ATLAS specific operations
- gridengine refuses to run multicore jobs (not 100%)
- cron to
qalter -R y
for the mcore jobs in the queue
- allow 20 reservations:
# qconf -ssconf|grep reservation
max_reservation 20
- Ongoing work
- prototyping ROCKS 6.1.1 deployment for re-installation (ROCKS 6.1 does not support newer hardware)
- 320 old SunBlade cores delivered from Ubelix, awaiting rackmounting, installation
- 6 IBM servers from CSCS to be picked up
- Temperature monitor in the room desirable, lookign at self-made solutions
-
UNIBE-ID
- Operations
- smooth operation, except see below
- decomissioned and dumped nearly all old Sun hardware
- all Sun Blade Chassis and Blades move to Gianfranco
- further progress in moving to Puppet managed environment, very promising
- ATLAS specific operations
- problems with grid-mapfile for atlas:
- fetching this works: vomss://voms2.cern.ch:8443/voms/atlas
- this stopped working last week: vomss://voms2.cern.ch:8443/voms/atlas?/atlas/Role=production
- consequence: no jobs were allowed since last Saturday after all DNs finally were dropped
- Investigations on operator side currently in progress
UNIGE
NGI_CH
- Update on certificates:
- 1 user certificate and 1 host certificate (voms.lhep.unibe.ch) requested and issued
- User cert OK, will verify the hostcert in the next few days
Other topics
Next meeting date:
A.O.B.
Attendants
- CSCS: Dino Conciatore, Gianni Ricciardi, Dario Petrusic. Apologies: Miguel Gila, Nick Cardo.
- CMS: Daniel Meister, Fabio Martinelli
- ATLAS: Gianfranco Sciacca
- LHCb: Roland Bernet
- EGI:Gianfranco Sciacca
Action items
This topic: LCGTier2
> WebHome >
MeetingsBoard > MeetingSwissGridOperations20150610
Topic revision: r15 - 2015-06-10 - RolandBernet