Site Downtime Procedure

This page contains details of what actions need to be taken in order to place the site into downtime

Change management

Make a note on http://neggio.cscs.ch/forum/ about the change.

Announcement

Prior to the site going down an official announcement must be made at least 5 days in advance.

CreamCE

To check the current state of a Cream you need a valid certificate, then execute the following

glite-ce-service-info cream01.lcg.cscs.ch

Interface Version  = [2.1]
Service Version    = [1.16.2 - EMI version: 3.6.0-1.el6]
Description        = [CREAM 2]
Started at         = [Fri Nov  8 17:53:18 2013]
Submission enabled = [NO]
Status             = [RUNNING]

To disable submission enter the following

glite-ce-disable-submission cream01.lcg.cscs.ch

The Creams make use of the following check in order to determine if they should publish they are draining or production. This file is managed by cfengine

/var/lib/bdii/gip/plugin/glite-info-dynamic-ce

Arc

To disable submission to the 'allownew' in ARC the arc.conf needs to be changed. This is managed by cfengine so make the following edit to each arc.

vim /srv/cfengine/files/arc01/etc/arc.conf

  #allownew=yes
  allownew=no

Slurm

Despite disabling submissions jobs may still find their way into the cluster as such we can set the partitions to a draining state within slurm.

To change a partition to drain do the following for each VO.

scontrol update partitionname=lcgadmin state=drain

Monitoring and Logging

If you are using the "at" command to schedule an action such as changing the partition state please make use of the mail functionality (-m flag) and write to a log to preserve historical data. Moreover, please post on the Change Management Tool running on neggio in order to maintain an official logging and a reference for other sysadmins.

For example setting a Slurm partition to drain.

at -m 7 AM + 5 days 'scontrol update partitionname=lcgadmin state=drain && echo "Set slurm partition lcgadmin to drain" | logger -t AT' 

TODO:

  • Nagios checks for partition state - DONE GB 20/11/2013
  • Nagios checks Cream submission state - Need to confirm LDAP output
  • Nagios checks ARC submission state - DONE GB 20/11/2013

-- GeorgeBrown - 2013-11-20


This topic: LCGTier2 > WebHome > ToolsBoard > FormsAndTemplates > SiteDowntimeProcedure
Topic revision: r5 - 2013-11-20 - GeorgeBrown
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback