Tags:
meeting1Add my vote for this tag SwissGridOperationsMeeting1Add my vote for this tag create new tag
view all tags

Swiss Grid Operations Meeting on 2014-05-13

Agenda

Status

  • CSCS (reports Miguel):
    • Operations:
      • Upgraded to the latest ARC CE release of 4.1, surprisingly painless
      • Summary of operations to be carried on our upcoming next maintenance (19.05.2014):
        1. Add index to job table in slurm db for start time possibly a multicolumn index of starttime, end time
        2. Install iPXE oproms on IB HCAs
        3. Upgrade dCache to the latest 2.6.x release an restart everywhere for xrootd plugin to take affect
        4. Upgrade CVMFS 2.1.17 to enable logging of when inode limit is reached for CVMFS.
        5. Update slurm.conf with less precise memory value for compute nodes + PAM
        6. Change runtimedir="/experiment_software/atlas/nordugrid/runtime" in arc.conf. This includes moving the data to a directory on the NAS
        7. Remove old unmanaged IB switch + old Sun X4140 named xen01 from rack #4
        8. Mount replaced IB switch on rails
        9. Prepare configuration for polyinstantiated /tmp on all WNs (http://wiki.chipp.ch/twiki/bin/view/LCGTier2/SiteSpecificModifications#Polyinstantiated_tmp_on_WNs). Enable it on only a few to test (~5)
        10. cvmfs1 does not have its cache dir on a seperate volume group like cvmfs, this should be corrected
    • Issues:
      • New WN seem to be hitting ATLAS CMVFS issue http://indico.cern.ch/event/289284/material/slides/0?contribId=0 Plan to upgrade to latest CVMFS release which will log an event for this and incorporate into blackhole detection.
      • dCache issue on Monday, numerous strange errors root cause was low free space (pools were not 100% full). Have tweaked monitoring to alert on this earlier, looking into wider range of checks.
  • PSI (reports Fabio):
  • UNIBE-LHEP (reports Gianfranco):
    • Production operations
      • Stable again with no downtime last month
      • However: EGI ops availability/reliability 33% due to probe org.sam.SRM-GetSURLs-/ops/NGI/Germany failing (causing all the other SRM probes to go into Unknown state). The problem self-resolved on 2nd May (no changes on the SE). Waiting for feedback, we might want to request availability/reliability re-calculation (tracked at https://ggus.eu/index.php?mode=ticket_info&ticket_id=104765 )
      • GLUE2 Validator Warnings were ticketed (Ref: https://xgus.ggus.eu/ngi_ch/?mode=ticket_info&ticket_id=314). Turned out to be a minor ARC infosys bug, corrected by patching SGEmod.pm
      • Critical DPM vulnerability in dmlite-libs (from EPEL only) announced yesterday: required upgrade from 0.6.2-1 to 0.6.2-2 and restart of services (performed yesterday).
    • ATLAS specific operations
      • ARC upgraded to 4.1.0-1.el6 from 4.0.0-1.el6 on ce01,2,3. Needed as ATLAS ops moved to Rucio as DDM tool (namely yesterday), making the previous version of ARC obsolete.
      • ATLASSCATCHDISK token deployed. With the New ATLAS ARC Control Tower, jobs will be able to use the local SE at the site and FAX as opposed to working exclusively with the remote T1 SE at NDGF.
      • FAX setup partially broken. Due to a host certificate problem on one of the disk servers since the Heartableed certificate campaign. New certificate requested yesterday, not yet released.
    • Accounting
      • Established that some unquantified number of records were lost during last summer as a consequence of SGAS misbehaviour and corrupted backup of our CE's records after their re-installation
      • Backlogs of usage records on ce01 and ce02 had to be manually sent into SGAS. Injection to SGAS was failing due to some records being too large, causing the ur-logger-registration to fail. Identified and moved away all records >50kB (some were tens of MB large!!) and pushed all records by hand to SGAS (over 200k records)
      • Switched off ur-logger and enabled Jura to send records to SGAS. Some hiccups at the beginning: published records not being archived as requested in the jobreport_options, then it started working all of a sudden (on ce01, ce02)
      • Added a jobreport_option to send the same records to APEL test servers as well. This worked straight away (on ce01, ce02)
      • Attempted to validate published numbers by cross-checking numbers in http://goc-accounting.grid-support.ac.uk/apeltest2/jobs.html with those painfully extracted from the archived records. Did not get a sensible conclusion (numbers on that portal are updated with some unpredictable delay), but figures on portal always greater that what cooked up from the archived records.
      • Last Friday decided to open a GGUS tkt in order to perform the switch to APEL production servers, however realised that since the ARC upgrade to 4.1.0, Jura stopped working (!): all Jura created records stay in <sessiondir>/jobstatus/logs. When triggered by hand, the publishing and archiving works, but only of some of the latest records. Ongoing effort...
    • Procurement
      • Took delivery of 6 36-disk servers (for the new Lustre deployment on ce01, replacing the Thumpers). Delay in setting them up due to lack of time yet.
  • UNIBE-ID (reports Michael):
    • Network performance
      • done one day of performance testing during downtime last week; setup new buffer profiles and tested them
      • moved 10G servers back into the switch fabric
      • performance under heavy load now better than previously
    • ARC CE accounting issue
      • EOL of old nordugrid.unibe.ch
      • switched CE to a new machine -> 2 weeks later realized that reported walltime/cputime is zero
      • /usr/bin/qacct was missing due to not having installed gridengine-qmaster (all other binaries like qsub, qstat.. in package gridengine) -> no error messages in any CE logs
      • issue fixed by installing gridengine-qmaster, bat no data for range 2014-04-10 to 2014-05-05 in SGAS now
      • ggus ticket opened: sanity check for qacct in scan-sge-job script missing -> now in nordugrid's bugzilla http://bugzilla.nordugrid.org/show_bug.cgi?id=3369
      • jobids available in archived records; accounting data avalable in sge accounting file, but...
      • ... how to fix with reasonable effort?
    • ARC CE accounting
      • switched from ur-logger->SGAS to JURA->SGAS + JURA->APEL-test
      • switch to JURA->APEL-prod on hold (see previous)
      • since the ARC upgrade to 4.1.0, Jura stopped working for us too like Gianfranco reported: all Jura created records stay in <sessiondir>/jobstatus/logs.
    • ARC CE
      • upgraded to 4.1
    • Infiniband
      • started planning for a proper fat tree in our IB network
      • goal: move gpfs daemon traffic from ETH to IB
    • Courses
      • conducted an introductory course (1.5h) on using our cluster to researchers at Department of Clinical Research
      • fruitful happening
    • Heartbleed (ammendment to last agenda)
      • immediate update of all relevant server (rhel-6)
      • exchanged two certificates
  • UNIGE (reports Szymon):
    • Finished adding ex Trigger hardware
      • 35 taken, 31 added 3 dead, 1 remaining but physically there
      • Cores removed: 8 login, 24 batch
      • Cores added: 16 login, 232 batch
      • total batch slots now: 656
    • Migration from SLC5 to SLC6
      • Last 10% of batch cores and last three login machines still run SLC5, some users need it
      • For batch, probably we can sole it soon. For login probably not (need of compilation).
    • ARC 4.1 installed on a new VM, config needs to be done
      • (thanks for the arc.conf of CSCS)
    • The 2014 upgrade is funded. The plan is:
      • two hosts to run crick services in virtual machines (IBM 3550 M4)
      • two disk servers for new 'user' and 'software' space (Solaris, ZFS, NFS) (Sun Server X4-2)
      • one disk server for bulk data storage (Linux, NFS) (IBM 3630 M4)
    • Maintenance issues:
      • one crash of Solaris disk server, when filled up
      • one crash of the VM running site BDII (I/O errors) reset necessary
      • one cooling problem of a hardware RAID on an IBM disk server

  • OMB (reports Gianfranco):
    • No relevant news from the latest OMB: https://wiki.egi.eu/wiki/Agenda-05-05-2014
    • latest MW versions in URT: ARC - 13.11u1 version 4.1.0; dCache - UMD-3 dcache-server 2.6.23; BDII core - new glue-validator; DPM/LFC - v. 1.8.8; GFAL/lcg_utils - v. 2.5.5
    • April EGI ops report (attached): 33% avail/reliab for UNIBE-LHEP. Might require re-computation.
    • Status of migration of National tasks from SWITCH ( no funding is available to cover these):
      • VOMS, GIIS -> migrated to BERN
      • SGAS -> being retired, aim for end of May
      • ARGUS -> see discussion below
      • ROD shifts -> will continue with CSCS
      • MONITORING -> Open, to be discussed at CHIPP CB (check server itself, site support, advanced support/logs, system config/update)
      • OMB -> meetings, monthly reports covered by BERN; accounting check, site config/performance/GGUS tks, security: each site covers their own (mutual support via [operations] list)
      • MAILING LISTS -> covered by BERN will receive list, GS should be admin of all
      • GOCDB -> Each site to look after their entries. Extra admin role in BERN (to be verified, should be able to edit all sites).
      • DTEAM, OPS -> Max two members for ops: BERN + ??
      • EUGridPMA -> Open (read minutes of meetings, Ales to send pointer to info)
Other topics
  • ROD shifts: currently CSCS is the only partner in CH doing this task. Since there are now more active sites in the NGI_CH cloud, CSCS expects this workload to be distributed.
  • ARGUS server: currently CREAM-CE is the only service using local ARGUS + CERN central banning. dCache could be configured?
  • ARGUS deployment status: https://xgus.ggus.eu/ngi_ch/?mode=ticket_info&ticket_id=284
    • National ARGUS instance requested (deadline: end 2013). This would pull info from the CERN servers. Site instances would pull info from the National ARGUS.
    • But:
      • dCache: not ARGUS support
      • ARC: global banning can be performed at the ARC Control Tower level (pioneered by wuppertal)
      • DPM: as of 1.8.8 (released yesterday), support for ARGUS is back.
      • Quoting Maarten Litmaath: "AFAIK global banning support still is not a hard requirement for anything, but that may change later this year: both EGI and WLCG would like to see it work at least for CEs."
      • Future support from SWITCH (Valery) purely on a best effort/emergency basis. If no-one else picks this up, it might as well be retired at some point (my speculation)
    • Our proposal: leave all as it is now, perhaps name CSCS service as National Service, argue with EGI (if needed) about ARC not needing ARGUS. Do not use it for SE's
Next meeting date:

AOB

Attendants

  • CSCS: George Brown, Miguel Gila
  • CMS: Fabio Martinelli, Daniel Meister
  • ATLAS: Gianfranco Sciacca, Szymon Gadomski
  • LHCb: Roland Bernet
  • EGI: Gianfranco Sciacca

Action items

  • Item1
Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf EGI_Apr2014.pdf r1 manage 159.8 K 2014-05-14 - 14:54 GianfrancoSciacca EGI availability report April 2014
PDFpdf SaltStack_Talk_at_PSI_-_F.Martinelli_-_7_May_20141.pdf r1 manage 4704.6 K 2014-05-12 - 17:01 FabioMartinelli SaltStack Talk at PSI - F.Martinelli - 7 May 2014
Edit | Attach | Watch | Print version | History: r22 < r21 < r20 < r19 < r18 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r22 - 2014-07-04 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback