Tags: view all tags

Swiss WLCG Operations Meeting on 2010-08-12

Date and time: 2010/08/12 at 9:30
Place: EVO, password: chipp
External link / EVO: http://evo.caltech.edu/evoNext/koala.jnlp?meeting=vsvivIeieeIMI9a8aDItas

Agenda

Report on unscheduled downtime (FG)
Discussion about Experiment Software Area
- ExperimentSofwareAreaProposal
Review Action Items
- CMS has to enable SAM tests for CreamCE
- Atlas has to check how CreamCE behaves and also enable SAM tests
AOB

Attendants

ATLAS: Gianfranco Sciacca, Marc Goulette, Sigve Haug, Szymon Gadomski
CMS: Derek Feichtinger
LHCb: Roland Bernet
CSCS: Fotis Georgatos, Peter Oettl

Minutes

Report on unscheduled downtime (FG)
- Troublesome situation due to various Lustre instabilities
- complexity/size of experiment-software aggravates Lustre risks
- VO reps realized the issue and asked what we can do about it
- CSCS has placed purchase orders for new controller hardware
- CSCS recommend to verify AND rethink on the exp-software dirs
- DF:
  - probably longest emergency downtime we ever experienced
  - VO-contacts were not aware that there are 4-5 lustre fail overs / month
  - if it only the Lustre shared scratch was affected, we could just wipe and rebuild it. All running jobs would be lost, but the downtime would only be some hours. Rebuilding the SW area can take days and also involves work by central operations people. So, we should separate the exp SW from the scratch (Also, Lustre is not ideal for storing huge amounts of tiny files as found in the exp SW area.)
  - many sites had similar experiences; they went back to NFS
  - CSCS management (MDL and/or DU) has to push on Sun. The system is severely affecting our operations and consuming excessive admin time to keep stable.
    - Hardware is troublesome
    - No adequate support is not delivered
- SH:
  - Lustre at Tier-3 since April
  - Experiment software remained on NFS
  - MDS crashes (no failover node)
- See also ticket #7851

Discussion about Experiment Software Area
- In short: go back to PhaseB implementation; DRBD is well tested
- Proposal: start from scratch so we have a known state and a clean reduced software area
  - VOs agree
  - SH: clarify with Andreij if ARC could use gLite software area
- VOs asked for more than 1 TB of total diskspace
- Offered solution:
  - Setup CE + WN to start software installation
  - no interruption needed; switch software area from Lustre to NFS after installation is finished

Review Action Items:
- VO Reps will check with their contacts what is possible to test
- RB: LHCb is running fine on CREAM

AOB
- SH: many sites in CH use Lustre; would be useful to gather experiences/knowledge
  - PO: HPC Forum about Parallel File Systems in October

Action items

CSCS: purchase hardware needed for implementing NFS setup
CSCS: open 3 tickets against Sun support; see ticket #7851
MG: check with VO to test CREAM CE and give status report; check availability of SAM tests for CREAM-CE
DF: check availability of SAM tests for CREAM-CE

~~Edit~~ | ~~Attach~~ | ~~Watch~~ | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | ~~Raw edit~~ | ~~More topic actions...~~

Topic revision: r4 - 2010-08-13 - DerekFeichtinger

LCGTier2

Log In

(Topic)

Home
LCGTier2 Web
- Users
- Groups
- Index
- Search
- Changes
- Notifications
- RSS Feed
- Statistics
- Preferences
View
Edit

Warning: Can't find topic "".""

Account
- Log In

~~Edit~~
~~Attach~~

Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback