Site Downtime Procedure
This page contains details of what actions need to be taken in order to place the site into downtime
Change management
Make a note on
http://neggio.cscs.ch/forum/ about the change.
Announcement
Prior to the site going down an official announcement must be made at least 5 days in advance.
To check the current state of a Cream you need a valid certificate, then execute the following
glite-ce-service-info cream01.lcg.cscs.ch
Interface Version = [2.1]
Service Version = [1.16.2 - EMI version: 3.6.0-1.el6]
Description = [CREAM 2]
Started at = [Fri Nov 8 17:53:18 2013]
Submission enabled = [NO]
Status = [RUNNING]
To disable submission enter the following
glite-ce-disable-submission cream01.lcg.cscs.ch
The Creams make use of the following check in order to determine if they should publish they are draining or production. This file is managed by cfengine
/var/lib/bdii/gip/plugin/glite-info-dynamic-ce
Arc
To disable submission to the 'allownew' in ARC the arc.conf needs to be changed. This is managed by cfengine so make the following edit to each arc.
vim /srv/cfengine/files/arc01/etc/arc.conf
#allownew=yes
allownew=no
Slurm
Despite disabling submissions jobs may still find their way into the cluster as such we can set the partitions to a draining state within slurm.
To change a partition to drain do the following for each VO.
scontrol update partitionname=lcgadmin state=drain
Monitoring and Logging
If you are using the "at" command to schedule an action such as changing the partition state please make use of the mail functionality (-m flag) and write to a log to preserve historical data.
Moreover, please post on the Change Management Tool running on
neggio
in order to maintain an official logging and a reference for other sysadmins.
For example setting a Slurm partition to drain.
at -m 7 AM + 5 days 'scontrol update partitionname=lcgadmin state=drain && echo "Set slurm partition lcgadmin to drain" | logger -t AT'
TODO:
- Nagios checks for partition state - DONE GB 20/11/2013
- Nagios checks Cream submission state - Need to confirm LDAP output
- Nagios checks ARC submission state - DONE GB 20/11/2013
--
GeorgeBrown - 2013-11-20