Second Steering Board Meeting

Information and Slides

  • The meeting takes place on Tue, Feb 22nd, 14-16h
  • Location: ETHZ, Building/Room No. LFW E 11 (Room reserved from 14-17 h, has Beamer, Reserv. Number E156029)
  • Meeting slides

Introduction of our new systems engineer Fabio Martinelli

Fabio Martinelli has joined our group on Febuary 1st, and he will take charge of the Tier-3. He already has begun to introduce a number of systematic improvements on the hardware monitoring level.

To be discussed

  • shared home directories
    • Enforced User quotas. What is the acceptable size for user home directories (currently we calculate 100 TB)
    • Enforced phys group quotas for shome?
  • SE
    • policy quotas on SE for users and phys groups
  • UI
    • scratch directory quotas? automatic cleaning (also for User Interfaces)
      • Derek: I think we should not have quotas on scratch. This will hurt users more than it is useful. Cleaning of scratch on a weekly basis should be enforced. If users use scratch for semi-permanent storage, we should investigate why and try to find a better solution (extra disk on some system?)
    • (from Urs) Distribution of user interfaces among institutes? Aim is to avoid resource conflicts (e.g. scratch overusage)
  • WN
    • (from Urs) debugging (possibly interactive) access to specific wn required?
  • review guest user policy: How many guest users can a phys group have... for how long?
  • planning of HW resources (see below)
  • should T3 be extended to also have a CE (to increase usage)?

Hardware situation and possible extensions

The current feeling is that we have enough CPU resources, but we could benefit from more storage (ca 100-150 TB more would probably be necessary)

Machines going out of warranty this year:

Node type node name Hardware warranty date
Admin node SUN X4150 2011-05-16
NFS experiment software server, log server t3nfs01 SUN X4150 2011-05-16
NFS home directory + VM server t3fs06 Thumper 2011-02-14
Home directory backup t3fs05 Thumper 2011-02-14
SE File servers t3fs01-t3fs04 Thumper 2011-06-02
Computing Element + frontier, mon t3ce01 SUN X4150 2011-05-16
SE head node t3se01 SUN X4150 2011-05-16
SE data base t3dcachedb01 SUN X4150 2011-05-16
User interfaces t3ui01-04 SUN X4150 2011-05-16
Virtual machine hosts t3vmmaster01, t3wn08 SUN X4150 2011-05-16
old worker nodes t3wn02-04 SUN X4150 2011-05-16

  • We can use an older X4150 WN to replace parts in one of the other X4150 machines
  • We could offline one thumper as a source for disks for failing disks in other thumpers. As a first measure, it would be good to buy a few replacement disks
  • Service nodes: We will try to put all non-IO intensive services onto the PSI virtualization infrastructure.

Possible upgrade of UI machines with more local disks

Mail from Mr. P. Eberhard from Oracle (2011-02-17): Regarding the disks for the X4150, we still have the following:

  • XRB-SS2CF146G10K-N
    • 146GB 10K RPM 2.5" SAS hard disk drive with Marlin bracket. RoHS-6. (x-option), 375.00 CHF
  • XRA-SS2CF300G10K-N
    • 300GB 10K RPM 2.5" SAS hard disk drive with Marlin bracket. (x-option) RoHS-6, 786.00 CHF

Initial mail from D. Feichtinger

Dear PSI-Tier3 Steering Board Members

We received a request from Urs Langenegger whether we would allow a second guest user for the b-physics group on our Tier-3. At our initial meeting we had defined a policy that one guest user per physics group would be accepted (policies are written down on https://wiki.chipp.ch/twiki/bin/view/CmsTier3/PhysicsGroupsOverview) .


Current situation:
* We have now ca 50 users (will provide better numbers taking inavtive users into account)
* CPU Resources are not tight. The queues are rarely contested these months
* SE space (ca 200 TB shared between users and data sets) is tight. According to http://t3mon.psi.ch/addmon/sespace.txt we currently host 106 TB of user data and 84 TB of "official" data
* We do no automatic enforcement of the SE policies. Need also to improve on accounting

On the short term, to answer Urs' request: Should the additional guest user be accepted as a temporary exception (should we set policy limitations)? Could we discuss this either in this mail thread, or if necessary in a short phone conference, if that is preferred.

On the longer term: We should meet early next year to talk about the development and operations of the system (new requirements, policies), now that we really have many active users. In Febuary, a dedicated system adminstrator will start working at PSI. The T3 will be his main responsibility. He will be able to implement better resource accounting, etc. I think it would be ideal if we could set up the steering board meeting for mid-Febuary (if there are no pressing reasons to do it earlier). If this sounds good to you, I will set up a doodle poll.

Cheers,
Derek

Meeting minutes

Original minutes from Leonardo Sala with additions by DF:

  • The 100GB quota per user on /shome has been approved. On special request a higher quota can be accorded (steering board members will receive information about such users)
  • We agreed on allowing 2 guest users per physics group
  • UIs
    • we will use old WNs as new UIs and distribute the users among the UIs. No hard separation but by convention! Distribution will probably be based on Institute allegiance. the issues of /scratch misuse.
    • No disk upgrade for the UIs is foreseen (too expensive): a mail warning about full scratch fs should be implemented, but no automatic cleaning of /scratch on UIs
  • SE
    • no hard quotas, but better accounting in order to keep track of (mis)usage. Proposal to use some DB backend to gather statistics and usage patterns
  • HW upgrade:
    • Focus on the SE upgrade, raw estimate: 100-150TB more, plus a major network improvement (Infiniband or similar) to prepare for the WN upgrades that may come next year. Timescale depends on funding. Cheapest solution is probably to stay with NAS based storage and dcache. But we will also consider systems based on a parallel FS + Storm or similar.
    • No major WN upgrade, but prepare a backup plan for a quick small upgrade (~ factor of two) in case of necessity during data-taking. Timescale and funding to be evaluated
  • Implementation of an interative queue in SGE, for debugging purposes, without requesting a dedicated core in order to gain high response in case of full queues. No direct interactive login access to WN apart from this will be offered.
  • Every institute representative should prepare and send to Christoph a couple of slides listing the scientific papers/analyses prepared taking advantage of the T3. This would explain the necessity of T3 and its upgrade to funding agencies better than monitoring plots
  • Generally speaking, the board is satisfied by the current infrastructure, and do not see bottlenecks for its usage.

Topic attachments
I Attachment History Action SizeSorted ascending Date Who Comment
PDFpdf PSI-T3-Board-20110222.pdf r1 manage 277.4 K 2011-02-24 - 09:14 DerekFeichtinger Meeting Slides

This topic: CmsTier3 > WebLeftBar > SteeringBoard > SteerBoardMeeting02
Topic revision: r12 - 2011-02-24 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback