<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup --> KeyWords: SysAdmin, [[Torque]], [[Maui]] ---+ A script to schedule downtimes There is a tedious part in scheduled downtimes: figuring out and actually typing the correct ==at== incantation to drain the job queues so that we start the downtime morning without running jobs in the cluster, yet allow users to run during the weekend before the maintenance. After some experimentation (see [[https://twiki.cscs.ch/twiki/bin/view/LCGTier2/OldPhoenixBlog#Draining_CE_queues_for_the_sched this blog post]] and [[https://webrt.cscs.ch/Ticket/Display.html?id=5637 CSCS ticket #5637]]), we came to the conclusion that the best thing to do is: * allow jobs in the =ops= queue (that is, SAM tests) even during the downtime: if the downtime has been properly scheduled in the [[https://goc.gridops.org GOCDB]] then they will not count against our reliability. * stop all other queues so that no new jobs are started; a queue allowing _X_ hours of CPU time should not start any new jobs (at least) _X+1_ hours _before_ the downtime begins; * drain all queues (except, again, =ops=): a queue will not accept any new jobs (I<draining> state) at the point in time when new jobs risk having less than 30 minutes of proxy validity left at the downtime end. For the purpose of computing this, it is assumed that each job's proxy lasts as much as the job requested CPU time, with a minimum of 12 hours. I've written a PERL script =/opt/cscs/sbin/downtime= to compute queue closing and draining times and submit appropriate ==at== jobs for controlling the [[Torque]] queues. (The script is being deployed by CfEngine and registered in its SubVersion repository.) Example usage: * Schedule a downtime at 9:00 on 2009-02-02 <verbatim> # downtime --verbose 2009-02-02 Downtime will start at: 09:00 on 2009-02-02 Downtime will end at: 17:00 on 2009-02-02 Draining 'egee8h' at 20:30 2009-02-01... job 23 at 2009-02-01 20:30 Draining 'egee24h' at 08:30 2009-02-01... job 24 at 2009-02-01 08:30 Draining 'others' at 08:30 2009-01-31... job 25 at 2009-01-31 08:30 Draining 'egee48h' at 08:30 2009-01-31... job 26 at 2009-01-31 08:30 Closing 'egee8h' at 22:29 2009-02-01... job 27 at 2009-02-01 22:29 Closing 'egee1h' at 06:59 2009-02-02... job 28 at 2009-02-02 06:59 Closing 'egee24h' at 02:29 2009-02-01... job 29 at 2009-02-01 02:29 Closing 'others' at 20:29 2009-01-30... job 30 at 2009-01-30 20:29 Closing 'egee48h' at 20:29 2009-01-30... job 31 at 2009-01-30 20:29 </verbatim> * Schedule a downtime at 10:00 on 2009-03-10, lasting 4:00 <verbatim> # downtime 2009-02-02 10:00 --duration 4:00 </verbatim> Note that the ==downtime== command needs to be run by a user who has the permissions to operate on the [[Torque]] queues. ---++ Readers' comments %COMMENT{type="below"}%
Attachments
Attachments
Topic attachments
I
Attachment
History
Action
Size
Date
Who
Comment
EXT
downtime
r1
manage
6.5 K
2009-01-26 - 14:50
RiccardoMurri
This topic: LCGTier2
>
WebHome
>
PhoenixClusterBlog
>
PhoenixBlog20090126x1258
Topic revision: r1 - 2009-01-26 - RiccardoMurri
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback