Tags:
create new tag
view all tags

Swiss WLCG Operations Meeting on 2010-08-12

Agenda

  • Report on unscheduled downtime (FG)
  • Discussion about Experiment Software Area
  • Review Action Items
    • CMS has to enable SAM tests for CreamCE
    • Atlas has to check how CreamCE behaves and also enable SAM tests
  • AOB

Attendants

  • ATLAS: Gianfranco Sciacca, Marc Goulette, Sigve Haug, Szymon Gadomski
  • CMS: Derek Feichtinger
  • LHCb: Roland Bernet
  • CSCS: Fotis Georgatos, Peter Oettl

Minutes

  • Report on unscheduled downtime (FG)
    • Troublesome situation due to various Lustre instabilities
    • complexity/size of experiment-software aggravates Lustre risks
    • VO reps realized the issue and asked what we can do about it
    • CSCS has placed purchase orders for new controller hardware
    • CSCS recommend to verify AND rethink on the exp-software dirs
    • DF:
      • probably longest emergency downtime we ever experienced
      • VO-contacts were not aware that there are 4-5 lustre fail overs / month
      • if it only the Lustre shared scratch was affected, we could just wipe and rebuild it. All running jobs would be lost, but the downtime would only be some hours. Rebuilding the SW area can take days and also involves work by central operations people. So, we should separate the exp SW from the scratch (Also, Lustre is not ideal for storing huge amounts of tiny files as found in the exp SW area.)
      • many sites had similar experiences; they went back to NFS
      • CSCS management (MDL and/or DU) has to push on Sun. The system is severely affecting our operations and consuming excessive admin time to keep stable.
        • Hardware is troublesome
        • No adequate support is not delivered
    • SH:
      • Lustre at Tier-3 since April
      • Experiment software remained on NFS
      • MDS crashes (no failover node)
    • See also ticket #7851

  • Discussion about Experiment Software Area
    • In short: go back to PhaseB implementation; DRBD is well tested
    • Proposal: start from scratch so we have a known state and a clean reduced software area
      • VOs agree
      • SH: clarify with Andreij if ARC could use gLite software area
    • VOs asked for more than 1 TB of total diskspace
    • Offered solution:
      • Setup CE + WN to start software installation
      • no interruption needed; switch software area from Lustre to NFS after installation is finished

  • Review Action Items:
    • VO Reps will check with their contacts what is possible to test
    • RB: LHCb is running fine on CREAM

  • AOB
    • SH: many sites in CH use Lustre; would be useful to gather experiences/knowledge
      • PO: HPC Forum about Parallel File Systems in October

Action items

  • CSCS: purchase hardware needed for implementing NFS setup
  • CSCS: open 3 tickets against Sun support; see ticket #7851
  • MG: check with VO to test CREAM CE and give status report; check availability of SAM tests for CREAM-CE
  • DF: check availability of SAM tests for CREAM-CE
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r5 - 2011-01-13 - PabloFernandez
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback