Announcing a Tier-3 shutdown

Announce the system halt on the GOC Pages (24h before if it is a scheduled downtime)
Announce the system halt on cms-tier3-users@lists.psi.ch

Shutting down the Tier-3

Temporary note: This sequence is based on a mail by Nina and I (Derek) followed the sequence and adapted in some places. I added explicit commands where I could. TODO (Nina):

check sequence and provide explicit (and homogeneous as far as possible) commands.
the commands must be runnable from t3admin01, not from the laptop. I.e. configuration files that define the names of worker nodes, service nodes, etc. and that help to send parallel commands must be available on t3admin01. I do not care which parallel mechanism is used (cexec or pssh, etc). But the configuration and commands to run must be explicitly in this list and the config must be local to the admin node.
the ssh keys in .ssh/known_hosts on t3admin01 were completely out of date, and seemingly a lot of the WN keys have changed. This prevents working from t3admin01
/etc/hosts contains a number of obsolete entries (MeG nodes still inside). Also, we need to define whether the public addresses are kept within that file or only the private ones. At the moment the public addresses are incomplete.

Sequence:

Before Downtime

do announcement in :
- t3 user list: cms-tier3-users@lists.psi.ch
- t3 admin wiki
- COGDB: https://goc.egi.eu/portal/index.php?Page_Type=Add_Downtime (To be SCHEDULED, start must be 24hrs in the future)
stop backup script of /shome (on t3nfs01 in /etc/cron.daily/zfssnap comment line
# /opt/zfssnap/zfssnap $PERIOD $MAXSNAPS $BACKUPSERVER >>$LOG 2>&1 )
when possible prepare FW updates on NetApp and HPs (t3nfs01,02 and t3admin02), etc. and prepare for yum + kernel updates

Downtime day:

Prevent further logins to the user interfaces. Modify /etc/security/access.conf on the user interfaces by commenting out the lines that allow access for all CMS users and for NX
```
#+ : cms : :0 129.129.0.0/16
+ : feichtinger loktionova_n : : ALL
#+ : cms : ALL
#+ : nx : ALL
- : ALL : ALL

   
```

Stop Nagios

ssh root@t3nagios /etc/init.d/nagios stop

disable all user queues on the WNs:
```
  ssh t3ce02 qmod -d *@* 
   
```
and and kill rest of jobs
Delete any remaining jobs in the queue system

Unmount PNFS on the nodes

umount /pnfs an all nodes: UIs, WNs, t3se01 and t3dcachedb03

       cexec ui: umount /pnfs/psi.ch/cms
       cexec wn: umount /pnfs/psi.ch/cms

comment in fstab /pnfs line to prevent mount after reboots

      for n in $(seq 10 59); do echo t3wn$n; ssh t3wn$n "sed -i 's/\(t3dcachedb03:\/pnfs\/psi.ch\/cms.*\)/# COMMENTED FOR DOWNTIME \1/' /etc/fstab"; done

Shut down the worker nodes

Shut down the nodes

      for n in $(seq 10 59) ; do echo t3wn$n; ssh root@t3wn$n shutdown -h now ; sleep 1 ; done

Check whether all nodes are down

for n in $(seq 10 20) 22 23 $(seq 25 47) $(seq 49 59) ; do node="t3wn$n"; echo -n "$node: "; ipmitool -I lanplus -H rmwn$n -U root -f /root/private/ipmi-pw chassis power status ; done

Stop PhEDEx on the t3cmsvobox (since it relies on dcache transfers). Notice that Phedex runs as the phedex user and not as root.
```
      ssh phedex@t3cmsvobox /home/phedex/config/T3_CH_PSI/PhEDEx/tools/init.d/phedex_Debug stop
     
```

service xrootd/cmsd stop on t3se01

       ssh t3se01 service xrootd stop
       ssh t3se01 service cmsd stop

dcache stop

Unmount PNFS from the SE and DB servers

      ssh t3se01 umount /pnfs
      ssh t3dcachedb03 umount /pnfs

pools: t2fs07-10, t3nfs02 4.4

  cexec fs: service dcache-server stop

doors (xrootd, dcap/gsidcap, gsifttp, srm, xrootd) and other services on t3se01
stop dcache on pools t3fs07-11

t3se01:

  ssh t3se01 service dcache-server stop

Stop dcache services on the DB server

  ssh t3dcachedb03 service dcache-server stop

Stop Postgresql on the DB server

  ssh t3dcachedb03 /etc/init.d/postgresql-9.5 stop

stop zookeeper on t3zkpr01-03

        for n in $(seq 1 3); do ssh t3zkpr0$n systemctl stop zookeeper.service; done

Stop the BDII

   ssh root@t3bdii "/etc/init.d/bdii stop"

frontier - VMs ?: (Derek: I left all VMs running. They will be shut down by the VM team)
shutdown t3nfs01, t3nfs02, t3fs07-11, t3gpu01-2, t3admin01/02
Shut down Netapp system (Link)
- Make sure, no background processes in operation (Santricity SMclient GUI)
- Turn off controller enclosure
- (Turn off any additional enclosure)

Start Tier-3

on VM zookeeper nodes t3zkp01-03 check service status zookeeper and zkcli -server t3zkpr01
t3dcachedb03 check postgres: /etc/init.d/postgresql-9.5 status/start and service crond status/start and dcache check-config and start all dcache main services dcache start and afterwards mount /pnfs
t3se01: mount /pnfs and start all services from dcache status besides doors
start dcache on pools t3fs01-11 (first switch on JBOD and than server)
check dcache logs in /var/log/dcache on pools, dcachedb and se machines
t3se01 start doors (if not done yet): dcap, gsidcap, gsiftp, srm, xrootd (for list, see dcache status)
t3se01 check (and start) xrootd redirector service cmsd/xrootd status
check on all UIs and WNs if /pnfs/psi.ch/cms is mounted: pssh -h node-list-t3/slurm-clients -P "mount |grep pnfs"
run test-dCacheProtocols from UI
When all T3 is UP one can fulfill the following useful checks:
- http://t3mon.psi.ch/ and https://icinga.psi.ch/
- https://t3nagios.psi.ch/check_mk/index.py?start_url=%2Fcheck_mk%2Fview.py%3Fview_name%3Dhosts%26host%3Dt3bdii02
- https://t3nagios.psi.ch/check_mk/index.py?start_url=%2Fcheck_mk%2Fview.py%3Fview_name%3Dhost%26host%3Dt3cmsvobox01%26site%3D
- https://etf-cms-prod.cern.ch/etf/check_mk/view.py?view_name=service&service=org.cms.SRM-VOPut-/cms/Role=production&host=t3se01.psi.ch
- phedex: https://cmsweb.cern.ch/phedex/prod/Components::Status
- batch: t3ce02 check/enable WNs qmod -e *@* and check sge status: /etc/init.d/sgedbwriter.p6444 status