Shutting down the Tier-3

Before Downtime

do announcement in :
- t3 user list: cms-tier3-users@lists.psi.ch
- t3 admin wiki
- COGDB: https://goc.egi.eu/portal/index.php?Page_Type=Add_Downtime (To be SCHEDULED, start must be 24hrs in the future)
check list of nodes on t3admin02 in node-list-t3 directories
stop snapshot script of /work (on t3nfs02 in /etc/cron.daily/zfssnap comment line
# /opt/zfssnap/zfssnap $PERIOD $MAXSNAPS $BACKUPSERVER >>$LOG 2>&1 )
when possible prepare FW updates on NetApp and HPs (t3nfs01,02 and t3admin02), etc. and prepare yum + kernel updates

Downtime day:

Prevent further logins to the user interfaces. Modify /etc/security/access_users.conf on the user interfaces by commenting out the lines that allow access for all CMS users
```
#+ : cms : ALL
- : ALL : ALL
   
```

stop icinga notifications: on emonma00 node in /opt/icinga-config/tier3/objects/tier3_templates.cfg comment the line with members like

define contactgroup{
        contactgroup_name       t3-admins
        alias                   TIER3 Administrators
   #    members          ......................

disable all user queues/ all Slurm Partitions:

  ssh t3slurm "scontrol update PartitionName=gpu State=DRAIN;scontrol update PartitionName=wn State=DRAIN; scontrol update PartitionName=qgpu State=DRAIN;scontrol update PartitionName=quick State=DRAIN "

Delete any remaining jobs in the queue system
Unmount PNFS on the nodes
1. umount /pnfs an all nodes: UIs, WNs (from t3admin02): pssh -h node-list-t3/slurm-clients -P "umount /pnfs/psi.ch/cms"
2. comment in fstab /pnfs line to prevent mount after reboots
```
      for n in $(seq 10 59); do echo t3wn$n; ssh t3wn$n "sed -i 's/$t3dcachedb03:\/pnfs\/psi.ch\/cms.*$/# COMMENTED FOR DOWNTIME \1/' /etc/fstab"; done
     
```
3. stop puppet run on slurm clients (optional)
if thers is a shutdown of t3nfs02, then umount /work and on cliens: "sed -i 's/$t3nfs02.*$/# Downtime \1/' /etc/fstab"
and correspondingly for big maintenance days: umount /t3home and "sed -i 's/$t3nfs*$/# Downtime \1/' /etc/fstab"

Shut down the worker nodes

Shut down the nodes

      for n in $(seq 10 59) ; do echo t3wn$n; ssh !root@t3wn$n shutdown -h now ; sleep 1 ; done

Check whether all nodes are down

for n in $(seq 10 20) 22 23 $(seq 25 47) $(seq 49 59) ; do node="t3wn$n"; echo -n "$node: "; ipmitool -I lanplus -H rmwn$n -U root -f /root/private/.ipmi-pw chassis power status ; done

Stop PhEDEx on the t3cmsvobox (since it relies on dcache transfers). Notice that Phedex runs as the phedex user and not as root.
```
      ssh phedex@t3cmsvobox /home/phedex/config/T3_CH_PSI/PhEDEx/tools/init.d/phedex_Debug stop
     
```
service xrootd and cmsd stop on t3se01 like ssh t3se01 systemctl stop xrootd@clustered
dcache stop steps:
1. stop doors on t3se01 - xrootd, dcap/gsidcap, gsifttp, srm, xrootd - all visible from "dcache status" like
```
 ssh  t3se01 dcache stop dcap-t3se01Domain
```
  and stop xrootd door on t3dcachedb03
2. pools t3fs07-11: [root@t3admin02 ~]# pssh -h node-list-t3/dcache-pools -P "dcache stop"
3. Unmount PNFS from the SE and DB servers: ssh t3se01 umount /pnfs; ssh t3dcachedb03 umount /pnfs
4. t3se01: ssh t3se01 dcache stop
5. Stop dcache services on the DB server : ssh t3dcachedb03 dcache stop
6. Stop Postgresql on the DB serve: ssh t3dcachedb03 systemctl status/stop postgresql-11
7. stop zookeeper on t3zkpr11-13: for n in $(seq 1 3); do ssh t3zkpr1$n systemctl stop zookeeper.service; done
Stop the BDII: ssh root@t3bdii "/etc/init.d/bdii stop"
frontier - VMs ?: (Derek: I left all VMs running. They will be shut down by the VM team)
shutdown t3nfs02, t3gpu01-2, t3admin02; t3fs07-11: on t3fs07-10 first off the server and afterwards the JBOD
Shut down Netapp system (Link)
- Make sure, no background processes in operation (Santricity SMclient GUI)
- Turn off controller enclosure
- (Turn off any additional enclosure)

Start Tier-3

power on hardware
on VM zookeeper nodes t3zkpr11-13 check systemctl status zookeeper and zkcli -server t3zkpr11 on t3zkpr11
t3dcachedb03 check postgres: systemctl start postgresql-11 and systemctl status crond and dcache check-config . Start all dcache main services dcache start *Domain besides doors (currently only one xrootd door configured on t3dcachedb03)
t3se01: start services beside doors from dcache status (like dcache start info-t3se01Domain currently the same for dcache-t3se01Domain, pinmanager-t3se01Domain, spacemanager-t3se01Domain, transfermanagers-t3se01Domain should be started ); doors (configured at the moment also on t3se01) should be stared after pools
mount /pnfs on t3se01 and t3dcachedb03
start dcache on pools t3fs01-11 [root@t3admin02 ~]# pssh -h node-list-t3/fs-dalco -P "dcache start" takes about 15-30' (in a case of hardware issue first switch on JBOD and than server)
check dcache logs in /var/log/dcache on pools, dcachedb and se machines
check if NetApp is visible from t3fs11: [root@t3nfs11 ~]# multipath -ll
t3se01 start doors (if not done yet): dcap, gsidcap, gsiftp, srm, xrootd (for list, see dcache status) like dcache start dcap-t3se01Domain, etc. and xrootd door on t3dcachedb03
t3se01 check (and start) xrootd redirector systemctl start cmsd@clustered ; systemctl start xrootd@clustered
check on all UIs and WNs/CNs if /pnfs/psi.ch/cms is mounted like pssh -h node-list-t3/slurm-clients -P "mount |grep pnfs"
Slurm: on t3slurm scontrol update PartitionName=gpu State=UP and scontrol update PartitionName=wn State=UP , etc. for all Partitions
When all T3 is UP one can fulfill the following useful checks:
- run test-dCacheProtocols from UI
- Monitoring List

Topic revision: r20 - 2020-06-03 - NinaLoktionova

CmsTier3

User Pages
Main Page
Policies

Physics Groups
Steering Board Meetings

Admin Pages
AdminArea
Cluster Specs