Tags:
create new tag
view all tags

KeyWords: SysAdmin, Torque, Maui

A script to schedule downtimes

There is a tedious part in scheduled downtimes: figuring out and actually typing the correct at incantation to drain the job queues so that we start the downtime morning without running jobs in the cluster, yet allow users to run during the weekend before the maintenance.

After some experimentation (see this blog post and CSCS ticket #5637), we came to the conclusion that the best thing to do is:

  • allow jobs in the ops queue (that is, SAM tests) even during the downtime: if the downtime has been properly scheduled in the GOCDB then they will not count against our reliability.
  • stop all other queues so that no new jobs are started; a queue allowing X hours of CPU time should not start any new jobs (at least) X+1 hours before the downtime begins;
  • drain all queues (except, again, ops): a queue will not accept any new jobs (I state) at the point in time when new jobs risk having less than 30 minutes of proxy validity left at the downtime end. For the purpose of computing this, it is assumed that each job's proxy lasts as much as the job requested CPU time, with a minimum of 12 hours.

I've written a PERL script /opt/cscs/sbin/downtime to compute queue closing and draining times and submit appropriate at jobs for controlling the Torque queues. (The script is being deployed by CfEngine and registered in its SubVersion repository.)

Example usage:

  • Schedule a downtime at 9:00 on 2009-02-02
# downtime --verbose 2009-02-02
Downtime will start at: 09:00 on 2009-02-02
Downtime will end at: 17:00 on 2009-02-02
Draining 'egee8h' at 20:30 2009-02-01...
job 23 at 2009-02-01 20:30
Draining 'egee24h' at 08:30 2009-02-01...
job 24 at 2009-02-01 08:30
Draining 'others' at 08:30 2009-01-31...
job 25 at 2009-01-31 08:30
Draining 'egee48h' at 08:30 2009-01-31...
job 26 at 2009-01-31 08:30
Closing 'egee8h' at 22:29 2009-02-01...
job 27 at 2009-02-01 22:29
Closing 'egee1h' at 06:59 2009-02-02...
job 28 at 2009-02-02 06:59
Closing 'egee24h' at 02:29 2009-02-01...
job 29 at 2009-02-01 02:29
Closing 'others' at 20:29 2009-01-30...
job 30 at 2009-01-30 20:29
Closing 'egee48h' at 20:29 2009-01-30...
job 31 at 2009-01-30 20:29

  • Schedule a downtime at 10:00 on 2009-03-10, lasting 4:00
# downtime 2009-02-02 10:00 --duration 4:00

Note that the downtime command needs to be run by a user who has the permissions to operate on the Torque queues.

Readers' comments

 
Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatEXT downtime r1 manage 6.5 K 2009-01-26 - 14:50 RiccardoMurri  
Topic revision: r1 - 2009-01-26 - RiccardoMurri
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback