Swiss Grid Operations Meeting on 2019-04-04 at 14:00
Site status
CSCS
Storage
- dCache
- New (CSCS) storage system (4x Huawei OceanStore 6800 V5, aggr. W/bw ~40GB/s, 18.1PB) has been delivered and configured (will be used also for dCache)
- Installing/configuring new 'se' nodes to start migrating data
- Will then decommission old servers and move the ones still under warranty to the new island
- Spctrum Scale
- Finalizing the plan to:
- Move from IB to 25G (need to replace card on each server) and re-configure GPFS
- Move 16 servers from old island to new island
- Move (SSD) storage from old island to new island
- Attach new exp. units to sc9000 controllers (doing only SSD now) and move GPFS "slow tier" from DDN SFA12k to sc9000
- Will need 2 days maintenance (but will do the best to make it in 1 day). Maintenance will be announced as soon as we have all things planned. Target dates are next week or the week after
Compute services
We've recently identified a number of jobs stalling or failing due to timeouts, where GPFS filesystem as seen on the compute nodes (via DVS) was very slow, at times unusable. When this was happening, typically the filesystem would stall with
mmfsd
running at 400% cpu time and DVS kernel threads piling up to about 1000, effectively making the load to flatline at about 1070, which is the maximum allowed in the code of DVS kernel module.
More... Close
While debugging this, we've identified and fixed some problems:
- ATLAS jobs have been constantly hitting the same set of files in the ARC cache.
- This has been now fixed by copying files from the cache to the session dir.
- Analysis jobs were disabled for a few days and have now been re-enabled.
- Some LHCb jobs were stalling due to
arc04
producing wrong BDII information.
- Fixed by re-introducing a tuning to ARC.
- Certain CMS jobs have been generating an unusually high number of files in very short periods of time (example: 550.000 small files in 4min).
- We've moved CMS jobs, which are mostly Analysis to an alternative scratch filesystem (Sonexion 1600, 2.7 PB Lustre) that is barely making it with the type of workload (20MB/s, 30.000 iops/s).
- We are evaluating DataWarp (again) with independent allocations per job, as well as other solutions.
- We contacted the user that we believe is the main driver of this workload, but it is now clear that we cannot easily ban a specific user from the system.
In this process about 10 people of Cray and CSCS have been involved, reaching the point where we were digging into the code of DVS itself to see what was going on.
As a result of this, we introduced a few minor changes aimed at simplifying and improving the performance of the GPFS filesystem exposed to compute nodes via DVS, and are certainly looking at ways to be more robust when such situations happen.
PSI
UNIBE-LHEP
- Xxx
- Accounting numbers (from scheduler) from last month
UNIBE-ID
UNIGE
- Xxx
- Accounting numbers (from scheduler) from last month
NGI_CH
- Xxx
- NGI-CH Open Tickets review
Other topics
Next meeting date:
A.O.B.
Attendants
- CSCS:
- CMS:
- ATLAS:
- LHCb:
- EGI:
Action items