Tags: view all tags

Atlas technical discussion Meeting on 2011-04-07

Date and time: Thursday at 14:00
Place: EVO
External link / EVO: http://evo.caltech.edu/evoNext/koala.jnlp?meeting=MDMaM82I2BDlDM9t9sD99D

Facts, and problem definition.

LHCb was having random timeouts on their software. This was due to excessive use of metadata operations on the scratch filesystem.
CSCS investigation on scratch FS usage showed that this was caused mainly by random user usage pattern, and also by Atlas production/pilots doing an excessive amount of background IO operations.
Atlas has a different memory usage pattern compared to the rest of the VOs. Some of their jobs use 3 or even 4 GB of RAM, relaying on swap to do the work.
If the node starts swapping, it will bring the performance on the node down, affecting other VOs.
Partitioning the WNs is harmful for everyone, since free resources can't be used by the others, and vice-versa.
Stability is CSCS priority.

Actions taken:

On March 29th CSCS confined Atlas into the new PhaseD nodes, with 2k HS06, using GPFS.
Two days later, this space was moved to 30 SunBlades (mounted with GPFS) to be able to provide 3.5 kHS06 instead of 2.

So, we have two problems that would be nice to solve asap.

Scratch FS problem

GPFS has been recently introduced in Phoenix, and we don't know how much stress can it handle. We have assigned Atlas's 30 nodes to use it. It uses 60 disks compared to 400 of Lustre.
We could try to "force" it if needed, as long as it does not affect any other VO.
We are working on a script to be able to ban users that do too much finds/du/ls, but:
- It's not a solution. This needs to be fixed in the source
- it may not be possible for Atlas, since their jobs are not identified per user (pilots and arc)

Swap problem

Having an external (Lustre/GPFS) file mounted as a loopback device does not work as expected. It results in IO errors.
We could try to use NFS instead. We are prepairing one Thumper to serve that space, that will be mounted by Atlas-only nodes. This is not considered to be completely safe, nor a valid solution for 96 nodes.

to split or no to split

Both solutions proposed are considered to be risky. If we are going to apply them, split is the only way to protect other VOs from problems.
it is not desirable. This should be only until a software solution is found, but it can take months.
If we don't split, we put other VOs in danger.

Attendants

Derek, Leo, Fabio, Gianfranco, Szymon, Sigve, Roland, Pablo

Minutes

We agreed that splitting was the worst scenario, so we should try to do our best to avoid it.
The idea then is to isolate Atlas as much as possible into the GPFS nodes, without boundaries, just node preferences. If we see further problems then we will have to talk about this again.
The solution to the hi-mem jobs is to create a hi-mem queue for Atlas

Action items

CSCS will remove the "cage" and implement a gravitation formula for Atlas to fall into the GPFS nodes (and the others out), if possible at all.
CSCS will create a atlas-himem queue with 4 GB default memory.

Edit | Attach | ~~Watch~~ | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions

Topic revision: r3 - 2011-04-27 - PabloFernandez

LCGTier2

Log In

(Topic)

Home
LCGTier2 Web
- Users
- Groups
- Index
- Search
- Changes
- Notifications
- RSS Feed
- Statistics
- Preferences
P
View
Edit

Warning: Can't find topic "".""

Account
- Log In

Edit
Attach

Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback