Shutting down the Tier-3
Before Downtime
- do announcement in :
- check list of nodes on t3admin02 in node-list-t3 directories
- stop snapshot script of /work (on t3nfs02 in /etc/cron.daily/zfssnap comment line
# /opt/zfssnap/zfssnap $PERIOD $MAXSNAPS $BACKUPSERVER >>$LOG 2>&1
)
- when possible prepare FW updates on NetApp and HPs (t3nfs01,02 and t3admin02), etc. and prepare yum + kernel updates
Downtime day:
- Prevent further logins to the user interfaces. Modify
/etc/security/access_users.conf
on the user interfaces by commenting out the lines that allow access for all CMS users
#+ : cms : ALL
- : ALL : ALL
- stop icinga notifications: on emonma00 node in
/opt/icinga-config/tier3/objects/tier3_templates.cfg
comment the line with members like
define contactgroup{
contactgroup_name t3-admins
alias TIER3 Administrators
# members ......................
- disable all user queues/ all Slurm Partitions:
ssh t3slurm "scontrol update PartitionName=gpu State=DRAIN;scontrol update PartitionName=wn State=DRAIN; scontrol update PartitionName=qgpu State=DRAIN;scontrol update PartitionName=quick State=DRAIN "
- Delete any remaining jobs in the queue system
- Unmount PNFS on the nodes
- umount /pnfs an all nodes: UIs, WNs (from t3admin02):
pssh -h node-list-t3/slurm-clients -P "umount /pnfs/psi.ch/cms"
- comment in fstab /pnfs line to prevent mount after reboots
for n in $(seq 10 59); do echo t3wn$n; ssh t3wn$n "sed -i 's/\(t3dcachedb03:\/pnfs\/psi.ch\/cms.*\)/# COMMENTED FOR DOWNTIME \1/' /etc/fstab"; done
- stop puppet run on slurm clients (optional)
- if thers is a shutdown of t3nfs02, then umount /work and on cliens: "sed -i 's/\(t3nfs02.*\)/# Downtime \1/' /etc/fstab"
- and correspondingly for big maintenance days: umount /t3home and "sed -i 's/\(t3nfs*\)/# Downtime \1/' /etc/fstab"
- Shut down the worker nodes
- Shut down the nodes
for n in $(seq 10 59) ; do echo t3wn$n; ssh !root@t3wn$n shutdown -h now ; sleep 1 ; done
- Check whether all nodes are down
for n in $(seq 10 20) 22 23 $(seq 25 47) $(seq 49 59) ; do node="t3wn$n"; echo -n "$node: "; ipmitool -I lanplus -H rmwn$n -U root -f /root/private/.ipmi-pw chassis power status ; done
- Stop PhEDEx on the t3cmsvobox (since it relies on dcache transfers). Notice that Phedex runs as the phedex user and not as root.
ssh phedex@t3cmsvobox /home/phedex/config/T3_CH_PSI/PhEDEx/tools/init.d/phedex_Debug stop
- service xrootd and cmsd stop on t3se01 like
ssh t3se01 systemctl stop xrootd@clustered
- dcache stop steps:
- stop doors on t3se01 - xrootd, dcap/gsidcap, gsifttp, srm, xrootd - all visible from "dcache status" like
ssh t3se01 dcache stop dcap-t3se01Domain
and stop xrootd door on t3dcachedb03
- pools t3fs07-11:
[root@t3admin02 ~]# pssh -h node-list-t3/dcache-pools -P "dcache stop"
- Unmount PNFS from the SE and DB servers:
ssh t3se01 umount /pnfs
; ssh t3dcachedb03 umount /pnfs
- t3se01:
ssh t3se01 dcache stop
- Stop dcache services on the DB server :
ssh t3dcachedb03 dcache stop
- Stop Postgresql on the DB serve:
ssh t3dcachedb03 systemctl status/stop postgresql-11
- stop zookeeper on t3zkpr11-13:
for n in $(seq 1 3); do ssh t3zkpr1$n systemctl stop zookeeper.service; done
- Stop the BDII:
ssh root@t3bdii "/etc/init.d/bdii stop"
- frontier - VMs ?: (Derek: I left all VMs running. They will be shut down by the VM team)
- shutdown t3nfs02, t3gpu01-2, t3admin02; t3fs07-11: on t3fs07-10 first off the server and afterwards the JBOD
- Shut down Netapp system (Link)
- Make sure, no background processes in operation (Santricity SMclient GUI)
- Turn off controller enclosure
- (Turn off any additional enclosure)
Start Tier-3
- power on hardware
- on VM zookeeper nodes t3zkpr11-13 check
systemctl status zookeeper
and zkcli -server t3zkpr11
on t3zkpr11
- t3dcachedb03 check postgres:
systemctl start postgresql-11
and systemctl status crond
and dcache check-config
. Start all dcache main services dcache start *Domain
besides doors (currently only one xrootd door configured on t3dcachedb03)
- t3se01: start services beside doors from
dcache status
(like dcache start info-t3se01Domain
currently the same for dcache-t3se01Domain, pinmanager-t3se01Domain, spacemanager-t3se01Domain, transfermanagers-t3se01Domain should be started ); doors (configured at the moment also on t3se01) should be stared after pools
-
mount /pnfs
on t3se01 and t3dcachedb03
- start dcache on pools t3fs01-11
[root@t3admin02 ~]# pssh -h node-list-t3/fs-dalco -P "dcache start"
takes about 15-30' (in a case of hardware issue first switch on JBOD and than server)
- check dcache logs in /var/log/dcache on pools, dcachedb and se machines
- check if NetApp is visible from t3fs11:
[root@t3nfs11 ~]# multipath -ll
- t3se01 start doors (if not done yet): dcap, gsidcap, gsiftp, srm, xrootd (for list, see
dcache status
) like dcache start dcap-t3se01Domain
, etc. and xrootd door on t3dcachedb03
- t3se01 check (and start) xrootd redirector
systemctl start cmsd@clustered
; systemctl start xrootd@clustered
- check on all UIs and WNs/CNs if /pnfs/psi.ch/cms is mounted like
pssh -h node-list-t3/slurm-clients -P "mount |grep pnfs"
- Slurm: on t3slurm
scontrol update PartitionName=gpu State=UP
and scontrol update PartitionName=wn State=UP
, etc. for all Partitions
- When all T3 is UP one can fulfill the following useful checks:
This topic: CmsTier3
> WebHome >
WebPreferences >
AdminArea2019 > HowToStartStopT32019
Topic revision: r20 - 2020-06-03 - NinaLoktionova