Tags: view all tags

Swiss Grid Operations Meeting on 2014-05-13

Date and time: First Thursday of the month, at 14:00
Place: Vidyo (room: Swiss_Grid_Operations_Meeting, extension: 9305236)
External link: http://vidyoportal.cern.ch/flex.html?roomdirect.html&key=gDf6l4RlIAGN
Phone gate: From Switzerland: 0227671400 (portal) + 9305236 (extension) + # (pound sign)
IRC chat: irc:gridchat.cscs.ch:994#lcg (ask pw via email)

Agenda

Status

CSCS (reports Miguel):
- Operations:
  - Upgraded to the latest ARC CE release of 4.1, surprisingly painless
  - Summary of operations to be carried on our upcoming next maintenance (19.05.2014):
    1. Add index to job table in slurm db for start time possibly a multicolumn index of starttime, end time
    2. Install iPXE oproms on IB HCAs
    3. Upgrade dCache to the latest 2.6.x release an restart everywhere for xrootd plugin to take affect
    4. Upgrade CVMFS 2.1.17 to enable logging of when inode limit is reached for CVMFS.
    5. Update slurm.conf with less precise memory value for compute nodes + PAM
    6. Change runtimedir="/experiment_software/atlas/nordugrid/runtime" in arc.conf. This includes moving the data to a directory on the NAS
    7. Remove old unmanaged IB switch + old Sun X4140 named xen01 from rack #4
    8. Mount replaced IB switch on rails
    9. Prepare configuration for polyinstantiated /tmp on all WNs (http://wiki.chipp.ch/twiki/bin/view/LCGTier2/SiteSpecificModifications#Polyinstantiated_tmp_on_WNs). Enable it on only a few to test (~5)
    10. cvmfs1 does not have its cache dir on a seperate volume group like cvmfs, this should be corrected
- Issues:
  - New WN seem to be hitting ATLAS CMVFS issue http://indico.cern.ch/event/289284/material/slides/0?contribId=0 Plan to upgrade to latest CVMFS release which will log an event for this and incorporate into blackhole detection.
  - dCache issue on Monday, numerous strange errors root cause was low free space (pools were not 100% full). Have tweaked monitoring to alert on this earlier, looking into wider range of checks.
PSI (reports Fabio):
- Generic Operations
  - We lost a Solaris 10 installation; I've reinstalled the affected SUN X4540 by using our JumpStart server.
  - I've upgraded our Nagios to the latest Nagios 4.0.6; in the future I'd want to try the new Nagios JSON API slide n.46
- dCache Operations
  - Installed https://github.com/zalando/pg_view
  - Installed v_pnfs at CSCS ; the other T3s should care if they want to analyze the CSCS /pnfs namespace ; e.g. PSI uses v_pnfs to publish its /pnfs dirs sizes.
  - I'm going to the dCache 2014 WorkShop
- An emerging competitor of Puppet written in Python : SaltStack Talk @ PSI - F.Martinelli - 7th May 2014
UNIBE-LHEP (reports Gianfranco):
- Production operations
  - Stable again with no downtime last month
  - However: EGI ops availability/reliability 33% due to probe org.sam.SRM-GetSURLs-/ops/NGI/Germany failing (causing all the other SRM probes to go into Unknown state). The problem self-resolved on 2nd May (no changes on the SE). Waiting for feedback, we might want to request availability/reliability re-calculation (tracked at https://ggus.eu/index.php?mode=ticket_info&ticket_id=104765 )
  - GLUE2 Validator Warnings were ticketed (Ref: https://xgus.ggus.eu/ngi_ch/?mode=ticket_info&ticket_id=314). Turned out to be a minor ARC infosys bug, corrected by patching SGEmod.pm
  - Critical DPM vulnerability in dmlite-libs (from EPEL only) announced yesterday: required upgrade from 0.6.2-1 to 0.6.2-2 and restart of services (performed yesterday).
- ATLAS specific operations
  - ARC upgraded to 4.1.0-1.el6 from 4.0.0-1.el6 on ce01,2,3. Needed as ATLAS ops moved to Rucio as DDM tool (namely yesterday), making the previous version of ARC obsolete.
  - ATLASSCATCHDISK token deployed. With the New ATLAS ARC Control Tower, jobs will be able to use the local SE at the site and FAX as opposed to working exclusively with the remote T1 SE at NDGF.
  - FAX setup partially broken. Due to a host certificate problem on one of the disk servers since the Heartableed certificate campaign. New certificate requested yesterday, not yet released.
- Accounting
  - Established that some unquantified number of records were lost during last summer as a consequence of SGAS misbehaviour and corrupted backup of our CE's records after their re-installation
  - Backlogs of usage records on ce01 and ce02 had to be manually sent into SGAS. Injection to SGAS was failing due to some records being too large, causing the ur-logger-registration to fail. Identified and moved away all records >50kB (some were tens of MB large!!) and pushed all records by hand to SGAS (over 200k records)
  - Switched off ur-logger and enabled Jura to send records to SGAS. Some hiccups at the beginning: published records not being archived as requested in the jobreport_options, then it started working all of a sudden (on ce01, ce02)
  - Added a jobreport_option to send the same records to APEL test servers as well. This worked straight away (on ce01, ce02)
  - Attempted to validate published numbers by cross-checking numbers in http://goc-accounting.grid-support.ac.uk/apeltest2/jobs.html with those painfully extracted from the archived records. Did not get a sensible conclusion (numbers on that portal are updated with some unpredictable delay), but figures on portal always greater that what cooked up from the archived records.
  - Last Friday decided to open a GGUS tkt in order to perform the switch to APEL production servers, however realised that since the ARC upgrade to 4.1.0, Jura stopped working (!): all Jura created records stay in <sessiondir>/jobstatus/logs. When triggered by hand, the publishing and archiving works, but only of some of the latest records. Ongoing effort...
- Procurement
  - Took delivery of 6 36-disk servers (for the new Lustre deployment on ce01, replacing the Thumpers). Delay in setting them up due to lack of time yet.
UNIBE-ID (reports Michael):
- Network performance
  - done one day of performance testing during downtime last week; setup new buffer profiles and tested them
  - moved 10G servers back into the switch fabric
  - performance under heavy load now better than previously
- ARC CE accounting issue
  - EOL of old nordugrid.unibe.ch
  - switched CE to a new machine -> 2 weeks later realized that reported walltime/cputime is zero
  - /usr/bin/qacct was missing due to not having installed gridengine-qmaster (all other binaries like qsub, qstat.. in package gridengine) -> no error messages in any CE logs
  - issue fixed by installing gridengine-qmaster, bat no data for range 2014-04-10 to 2014-05-05 in SGAS now
  - ggus ticket opened: sanity check for qacct in scan-sge-job script missing -> now in nordugrid's bugzilla http://bugzilla.nordugrid.org/show_bug.cgi?id=3369
  - jobids available in archived records; accounting data avalable in sge accounting file, but...
  - ... how to fix with reasonable effort?
- ARC CE accounting
  - switched from ur-logger->SGAS to JURA->SGAS + JURA->APEL-test
  - switch to JURA->APEL-prod on hold (see previous)
  - since the ARC upgrade to 4.1.0, Jura stopped working for us too like Gianfranco reported: all Jura created records stay in <sessiondir>/jobstatus/logs.
- ARC CE
  - upgraded to 4.1
- Infiniband
  - started planning for a proper fat tree in our IB network
  - goal: move gpfs daemon traffic from ETH to IB
- Courses
  - conducted an introductory course (1.5h) on using our cluster to researchers at Department of Clinical Research
  - fruitful happening
- Heartbleed (ammendment to last agenda)
  - immediate update of all relevant server (rhel-6)
  - exchanged two certificates
UNIGE (reports Szymon):
- Finished adding ex Trigger hardware
  - 35 taken, 31 added 3 dead, 1 remaining but physically there
  - Cores removed: 8 login, 24 batch
  - Cores added: 16 login, 232 batch
  - total batch slots now: 656
- Migration from SLC5 to SLC6
  - Last 10% of batch cores and last three login machines still run SLC5, some users need it
  - For batch, probably we can sole it soon. For login probably not (need of compilation).
- ARC 4.1 installed on a new VM, config needs to be done
  - (thanks for the arc.conf of CSCS)
- The 2014 upgrade is funded. The plan is:
  - two hosts to run crick services in virtual machines (IBM 3550 M4)
  - two disk servers for new 'user' and 'software' space (Solaris, ZFS, NFS) (Sun Server X4-2)
  - one disk server for bulk data storage (Linux, NFS) (IBM 3630 M4)
- Maintenance issues:
  - one crash of Solaris disk server, when filled up
  - one crash of the VM running site BDII (I/O errors) reset necessary
  - one cooling problem of a hardware RAID on an IBM disk server

OMB (reports Gianfranco):
- No relevant news from the latest OMB: https://wiki.egi.eu/wiki/Agenda-05-05-2014
- latest MW versions in URT: ARC - 13.11u1 version 4.1.0; dCache - UMD-3 dcache-server 2.6.23; BDII core - new glue-validator; DPM/LFC - v. 1.8.8; GFAL/lcg_utils - v. 2.5.5
- April EGI ops report (attached): 33% avail/reliab for UNIBE-LHEP. Might require re-computation.
- Status of migration of National tasks from SWITCH ( no funding is available to cover these):
  - VOMS, GIIS -> migrated to BERN
  - SGAS -> being retired, aim for end of May
  - ARGUS -> see discussion below
  - ROD shifts -> will continue with CSCS
  - MONITORING -> Open, to be discussed at CHIPP CB (check server itself, site support, advanced support/logs, system config/update)
  - OMB -> meetings, monthly reports covered by BERN; accounting check, site config/performance/GGUS tks, security: each site covers their own (mutual support via [operations] list)
  - MAILING LISTS -> covered by BERN will receive list, GS should be admin of all
  - GOCDB -> Each site to look after their entries. Extra admin role in BERN (to be verified, should be able to edit all sites).
  - DTEAM, OPS -> Max two members for ops: BERN + ??
  - EUGridPMA -> Open (read minutes of meetings, Ales to send pointer to info)

Attendants

CSCS: George Brown, Miguel Gila
CMS: Fabio Martinelli, Daniel Meister
ATLAS: Gianfranco Sciacca, Szymon Gadomski
LHCb: Roland Bernet
EGI: Gianfranco Sciacca

Action items

Item1

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who	Comment
pdf	EGI_Apr2014.pdf	r1	manage	159.8 K	2014-05-14 - 14:54	GianfrancoSciacca	EGI availability report April 2014
pdf	SaltStack_Talk_at_PSI_-_F.Martinelli_-_7_May_20141.pdf	r1	manage	4704.6 K	2014-05-12 - 17:01	FabioMartinelli	SaltStack Talk at PSI - F.Martinelli - 7 May 2014