Atlas technical discussion Meeting on 2011-04-07
Facts, and problem definition.
- LHCb was having random timeouts on their software. This was due to excessive use of metadata operations on the scratch filesystem.
- CSCS investigation on scratch FS usage showed that this was caused mainly by random user usage pattern, and also by Atlas production/pilots doing an excessive amount of background IO operations.
- Atlas has a different memory usage pattern compared to the rest of the VOs. Some of their jobs use 3 or even 4 GB of RAM, relaying on swap to do the work.
- If the node starts swapping, it will bring the performance on the node down, affecting other VOs.
Actions taken:
- On March 29th CSCS confined Atlas into the new PhaseD nodes, with 2k HS06, using GPFS. Two days later, this space was moved
So, we have two problems that would be nice to solve asap. Maybe the best way would be address each separately, and then try to find a combined solution.
Scratch FS problem
Attendants
Minutes
Action items