Swiss Grid Operations Meeting on 2014-06-05

Site status

CSCS

  • Maintenance report:
    • So far the private /tmp seems to be working without issue
    • During the mounting of the replaced switch we had issues with the switch not starting up correctly. Unknown root cause switch is now back up, 4 WN were down for an extended period.
    • Had issues with multiple sites listed in dcache xrootd plugin, due to using incorrect syntax monitoring currently split between ATLAS and CMS
    • All production NFS mounts are now provided by CSCS NAS
    • Everything else went well.
  • Other:
    • Currently have given access to DIRAC developer to help improve slurm integration
    • Starting engagement with APEL team to improve, if interested the following wiki page will updated over time https://wiki.egi.eu/wiki/APEL_Batch_Support
    • Shipped WN and unmanaged IB switch to PSI
    • Saw the following on the dCache mailing list regarding useful scripts. May be something to keep an eye on https://github.com/dCache/scripts
    • Working on moving backup from CSCS /store file system to NAS

PSI

  • Installing 10 SL6 UIs ( my top priority during these days ):
    • We've got 10 2011 AMD servers from CSCS, thank you again colleagues.
    • To install these servers I need to make room in our rack, unmount the old HW and mount the new HW, recabling, ... the usual time consuming tasks.
    • Last year we've got plenty of 3.5" SATA 1TB disks from CSCS, so I'll install the 10 UIs as mdadm RAID10 that delivers the max performances; I can choose among 3 possible layouts: NEAR layout , FAR layout , OFFSET layout. So far OFFSET seems my choice.
    • About the filesystems, EXT4 for / , /var/ , /opt ... and XFS for /scratch ; XFS automatically detects the underlying mdadm configuration.
    • Because the 10 UIs are exposed to Internet I'll try to follow the CIS guidelines for RHEL6.
  • Preparing the T3 for my two Summer breaks:
    • Summer => 'don't touch the configurations'
    • Writing new Doc and updating the old Doc, reviewing each Nagios check, the Puppet recipes, the crons, etc.. it's a lot of work.
    • Derek and Daniel will take care of the T3 during my absences.
  • dCache 2.6
    • Enabled CMS Xrootd monitoring; asked to CSCS to do the same by Ticket #15997
    • Asked users to delete old data because we were close to 95% => deleted 70TB !
    • Slowly testing the upgrade to 2.6.28
    • Slowly studying WebDav because of the WLCG HTTP Federations.

UNIBE-LHEP

  • Production operations
    • Lustre MDS glitch on Fri 16th May on the ce01 cluster, caused the cluster to hang for some hours. After detected, recovery took ~1.5h: kill all jobs, unmount all clients, stop all OSTs. Stopping the MDT (unmount) would not work, so had to power-cycle the MDS. After restarting MDT, all OST's, mounted on all clients, it slowly came back together and operations could be resumed.
    • a-rex crashed on ce02 on 19th May. Detected early and restarted it. (Symptom: resource disappears from the GIIS)
    • bdii service crashed on the site-bdii, but left behind the slapd process running, preventing the restart cron from fixing it within the 15min window.
    • xrootd service on the DPM broken for some weeks: the xrootd segfaults upon start. Followed advice from DPM experts to upgrade to the latest version 1.8.8, was 1.8.7), which fixed the issue. FAX redirection still might need some additional fix (ongoing), suspect a misconfiguration on at least one disk server.
    • Repoened https://ggus.eu/index.php?mode=ticket_info&ticket_id=104765 (bogus Nagios probe failing in April, causing 33% availability for the site)
  • ATLAS specific operations
  • Accounting
    • confusion created by changes in the APEL brokers. There is no longer a test and a prod broker, there is only a prod network broker (new hostnames) on which a test queue is active as well https://ggus.eu/index.php?mode=ticket_info&ticket_id=105610
    • tried to trigger Jura manually and it seems to connect to the prod network just fine. However, the jobrecords initially created by a-rex, include the hostname of the APEL endpoint, so all records with the wrong settings must be changed by hand
    • still a-rex is NOT triggering Jura automatically. Must cleanup all jobrecords first, before resuming investigation on this. Backlog becoming quite serious now (>700k records)

UNIBE-ID

  • Mainly stable operations
  • ARC CE accounting
    • switched to JURA->APEL-prod and resent old records (2014-05-06 - present)
    • accounting now works like expected for us
    • issue with old logs from 2014-04-10 - 2014-05-05 (pre APEL era, when missing qacct issue) still open
  • Procurement
    • Finally received the 4 x3550 M4 to replace old hardware (2x frontend server, gridengine master, ...)
    • Received 7 new WNs (HS23) to build up new long.q and short.q -> more cores for mpi.q
  • Question to all RHEL-6 users: How much do you pay for per license for "RHEL Server for HPC Compute Node, with Smart Management"?

UNIGE

  • Migration from SLC5 to SLC6 finished for batch
    • three login machines still running SLC5 until nobody needs to compile with old software releases
  • DPM upgrade to 1.8.8
    • head node and 15 disk servers
  • New CE running NorduGrid ARC 4.1 in production
    • an issue with a drifting clock in the VM, disappeared after restart of the VM
  • Started procurement for the 2014 upgrade. The plan is:
    • two hosts too run critical services in virtual machines (IBM 3550 M4)
    • two disk servers for new 'user' and 'software' space (Solaris, ZFS, NFS) (Sun Server X4-2)
    • one disk server for bulk data storage (Linux, NFS) (IBM 3630 M4)
    • NEW: double CPU and memory in two IBM x3755, they will have 64 core and 192 GB each
  • Cleanup of the LOCALGROUPDISK space token of the SE, freed 87 of 436 TB (20%)
    • all data is in datasets, we work with flat lists of datasets
    • asking people to list what they need is enough
  • Maintenance issues
    • voms-proxy-info in EMI3 is in java and needs >2 GB of RAM, crashes in batch jobs => use arcproxy
    • ssh config to have svn from CERN working w/o password on SLC6
    • IBM service took one month to come, dysfunctional communication

NGI_CH

  • OMB
  • NGI_CH National tasks
    • Status of migration of National tasks from SWITCH ( no funding is available to cover these ):
      • VOMS, GIIS -> migrated to BERN
      • SGAS -> being retired, aim for end of May
      • ARGUS -> see discussion below
      • ROD shifts -> will continue with CSCS
      • MONITORING -> Open, to be discussed at CHIPP CB (check server itself, site support, advanced support/logs, system config/update)
      • OMB -> meetings, monthly reports covered by BERN; accounting check, site config/performance/GGUS tks, security: each site covers their own (mutual support via [operations] list)
      • MAILING LISTS -> covered by BERN will receive list, GS should be admin of all
      • GOCDB -> Each site to look after their entries. Extra admin role in BERN (to be verified, should be able to edit all sites).
      • DTEAM, OPS -> Max two members for ops: BERN + ??
      • EUGridPMA -> Open (read minutes of meetings, Ales to send pointer to info)
  • ARGUS deployment status
    • https://xgus.ggus.eu/ngi_ch/?mode=ticket_info&ticket_id=284
    • National ARGUS instance requested (deadline: end 2013). This would pull info from the CERN servers. Site instances would pull info from the National ARGUS.
    • But:
      • dCache: not ARGUS support
      • ARC: global banning can be performed at the ARC Control Tower level (pioneered by wuppertal, ongoing development).
      • DPM: as of 1.8.8, support for ARGUS is back.
      • Quoting Maarten Litmaath: "AFAIK global banning support still is not a hard requirement for anything, but that may change later this year: both EGI and WLCG would like to see it work at least for CEs."
      • Future support from SWITCH (Valery) purely on a best effort/emergency basis. If no-one else picks this up, it might as well be retired at some point (my speculation)
    • Two possibilities:
      • Leave all as it is now, name CSCS service as National Service
      • Team with the Germans (already negotiated) and use their National instance (this likely implies CSCS ARGUS re-configuration)

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS:
  • CMS: Fabio Martinelli, Daniel Meister
  • ATLAS: Gianfranco Sciacca
  • LHCb: Roland Bernet
  • EGI:

Action items

  • Item1
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r13 - 2014-07-01 - FabioMartinelli
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback