Swiss Grid Operations Meeting on 2019-04-04 at 14:00

Site status

CSCS

Storage

  • dCache
    • New (CSCS) storage system (4x Huawei OceanStore 6800 V5, aggr. W/bw ~40GB/s, 18.1PB) has been delivered and configured (will be used also for dCache)
    • Installing/configuring new 'se' nodes to start migrating data
    • Will then decommission old servers and move the ones still under warranty to the new island

  • Spctrum Scale
    • Finalizing the plan to:
      • Move from IB to 25G (need to replace card on each server) and re-configure GPFS
      • Move 16 servers from old island to new island
      • Move (SSD) storage from old island to new island
      • Attach new exp. units to sc9000 controllers (doing only SSD now) and move GPFS "slow tier" from DDN SFA12k to sc9000
    • Will need 2 days maintenance (but will do the best to make it in 1 day). Maintenance will be announced as soon as we have all things planned. Target dates are next week or the week after

Compute services

We've recently identified a number of jobs stalling or failing due to timeouts, where GPFS filesystem as seen on the compute nodes (via DVS) was very slow, at times unusable. When this was happening, typically the filesystem would stall with mmfsd running at 400% cpu time and DVS kernel threads piling up to about 1000, effectively making the load to flatline at about 1070, which is the maximum allowed in the code of DVS kernel module. More... Close dvsload.png

While debugging this, we've identified and fixed some problems:

  • ATLAS jobs have been constantly hitting the same set of files in the ARC cache.
    • This has been now fixed by copying files from the cache to the session dir.
    • Analysis jobs were disabled for a few days and have now been re-enabled.
  • Some LHCb jobs were stalling due to arc04 producing wrong BDII information.
    • Fixed by re-introducing a tuning to ARC.
  • Certain CMS jobs have been generating an unusually high number of files in very short periods of time (example: 550.000 small files in 4min).
    • We've moved CMS jobs, which are mostly Analysis to an alternative scratch filesystem (Sonexion 1600, 2.7 PB Lustre) that is barely making it with the type of workload (20MB/s, 30.000 iops/s).
      snx1600.png
      snx1600-2.png
      snx1600-3.png
    • We are evaluating DataWarp (again) with independent allocations per job, as well as other solutions.
    • We contacted the user that we believe is the main driver of this workload, but it is now clear that we cannot easily ban a specific user from the system.

In this process about 10 people of Cray and CSCS have been involved, reaching the point where we were digging into the code of DVS itself to see what was going on.

As a result of this, we introduced a few minor changes aimed at simplifying and improving the performance of the GPFS filesystem exposed to compute nodes via DVS, and are certainly looking at ways to be more robust when such situations happen.

PSI

UNIBE-LHEP

  • Xxx
  • Accounting numbers (from scheduler) from last month

UNIBE-ID

  • Xxx

UNIGE

  • Xxx
  • Accounting numbers (from scheduler) from last month

NGI_CH

  • Xxx
  • NGI-CH Open Tickets review

Other topics

  • Topic1
  • Topic2
Next meeting date:

A.O.B.

Attendants

  • CSCS:
  • CMS:
  • ATLAS:
  • LHCb:
  • EGI:

Action items

  • Item1
Topic attachments
I Attachment History Action Size Date Who CommentSorted ascending
PNGpng dvsload.png r1 manage 393.0 K 2019-04-04 - 11:17 MiguelGila DVS load flatlining at ~1000
PNGpng snx1600-2.png r1 manage 69.5 K 2019-04-04 - 11:22 MiguelGila  
PNGpng snx1600-3.png r1 manage 64.1 K 2019-04-04 - 11:22 MiguelGila  
PNGpng snx1600.png r1 manage 32.2 K 2019-04-04 - 11:22 MiguelGila  
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2019-04-04 - MiguelGila
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback