Site Downtime Procedure
This page contains details of what actions need to be taken in order to place the site into downtime
Announcement
Prior to the site going down an official announcement must be made at least 5 days in advance.
To check the current state of a Cream you need a valid certificate, then execute the following
glite-ce-service-info cream01.lcg.cscs.ch
Interface Version = [2.1]
Service Version = [1.16.2 - EMI version: 3.6.0-1.el6]
Description = [CREAM 2]
Started at = [Fri Nov 8 17:53:18 2013]
Submission enabled = [NO]
Status = [RUNNING]
To disable submission enter the following
glite-ce-disable-submission cream01.lcg.cscs.ch
The Creams make use of the following check in order to determine if they should publish they are draining or production. This file is managed by cfengine
/var/lib/bdii/gip/plugin/glite-info-dynamic-ce
Arc
To disable submission to the 'allownew' in ARC the arc.conf needs to be changed. This is managed by cfengine so make the following edit to each arc.
vim /srv/cfengine/files/arc01/etc/arc.conf
#allownew=yes
allownew=no
Slurm
Despite disabling submissions jobs may still find their way into the cluster as such we can set the partitions to a draining state within slurm.
To change a partition to drain do the following for each VO.
scontrol update partitionname=lcgadmin state=drain
Monitoring
If you are using the "at" command to schedule an action such as changing the partition state please make use of the mail functionality and write to a log to preserve historical data.
For example setting a Slurm partition to drain.
at -m 7 AM + 5 days 'scontrol update partitionname=lcgadmin state=drain && echo "Set slurm partition lcgadmin to drain" | logger -t AT'
TODO:
- Nagios checks for partition state
- Nagios checks Cream submission state
- Nagios checks ARC submission state
--
GeorgeBrown - 2013-11-20