KeyWords:
SysAdmin,
Torque,
Maui
A script to schedule downtimes
There is a tedious part in scheduled downtimes: figuring out and
actually typing the correct
at
incantation to drain the job queues
so that we start the downtime morning without running jobs in the
cluster, yet allow users to run during the weekend before the
maintenance.
After some experimentation
(see
this blog post
and
CSCS ticket #5637),
we came to the conclusion that the best thing to do is:
- allow jobs in the
ops
queue (that is, SAM tests) even during the downtime: if the downtime has been properly scheduled in the GOCDB then they will not count against our reliability.
- stop all other queues so that no new jobs are started; a queue allowing X hours of CPU time should not start any new jobs (at least) X+1 hours before the downtime begins;
- drain all queues (except, again,
ops
): a queue will not accept any new jobs (I state) at the point in time when new jobs risk having less than 30 minutes of proxy validity left at the downtime end. For the purpose of computing this, it is assumed that each job's proxy lasts as much as the job requested CPU time, with a minimum of 12 hours.
I've written a PERL script
/opt/cscs/sbin/downtime
to compute queue
closing and draining times and submit appropriate
at
jobs for
controlling the
Torque queues. (The script is being deployed by
CfEngine and registered in its
SubVersion repository.)
Example usage:
- Schedule a downtime at 9:00 on 2009-02-02
# downtime --verbose 2009-02-02
Downtime will start at: 09:00 on 2009-02-02
Downtime will end at: 17:00 on 2009-02-02
Draining 'egee8h' at 20:30 2009-02-01...
job 23 at 2009-02-01 20:30
Draining 'egee24h' at 08:30 2009-02-01...
job 24 at 2009-02-01 08:30
Draining 'others' at 08:30 2009-01-31...
job 25 at 2009-01-31 08:30
Draining 'egee48h' at 08:30 2009-01-31...
job 26 at 2009-01-31 08:30
Closing 'egee8h' at 22:29 2009-02-01...
job 27 at 2009-02-01 22:29
Closing 'egee1h' at 06:59 2009-02-02...
job 28 at 2009-02-02 06:59
Closing 'egee24h' at 02:29 2009-02-01...
job 29 at 2009-02-01 02:29
Closing 'others' at 20:29 2009-01-30...
job 30 at 2009-01-30 20:29
Closing 'egee48h' at 20:29 2009-01-30...
job 31 at 2009-01-30 20:29
- Schedule a downtime at 10:00 on 2009-03-10, lasting 4:00
# downtime 2009-02-02 10:00 --duration 4:00
Note that the
downtime
command needs to be run by a user who has
the permissions to operate on the
Torque queues.
Readers' comments