<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup --> ---+ Shutting down the Tier-3 *Before* Downtime 1. do announcement in : * t3 user list: cms-tier3-users@lists.psi.ch * t3 admin wiki * COGDB: https://goc.egi.eu/portal/index.php?Page_Type=Add_Downtime (To be SCHEDULED, start must be 24hrs in the future) 1. check list of nodes on *t3admin02* in *node-list-t3* directories 1. stop snapshot script of /work (on t3nfs02 in /etc/cron.daily/zfssnap comment line </br> =# /opt/zfssnap/zfssnap $PERIOD $MAXSNAPS $BACKUPSERVER >>$LOG 2>&1= ) 1. when possible prepare FW updates on !NetApp and HPs (t3nfs01,02 and t3admin02), etc. and prepare yum + kernel updates Downtime day: 1. Prevent further logins to the user interfaces. Modify =/etc/security/access_users.conf= on the user interfaces by commenting out the lines that allow access for all CMS users <pre> #+ : cms : ALL - : ALL : ALL </pre> 1. stop icinga notifications: on *emonma00* node in =/opt/icinga-config/tier3/objects/tier3_templates.cfg= comment the line with members like <pre> define contactgroup{ contactgroup_name t3-admins alias TIER3 Administrators # members ...................... </pre> 1. disable all user queues/ all Slurm Partitions: <pre> ssh t3slurm "scontrol update PartitionName=gpu State=DRAIN;scontrol update PartitionName=wn State=DRAIN; scontrol update PartitionName=qgpu State=DRAIN;scontrol update PartitionName=quick State=DRAIN " </pre> 1. Delete any remaining jobs in the queue system 1. Unmount PNFS on the nodes 1. umount /pnfs an all nodes: UIs, WNs (from t3admin02): =pssh -h node-list-t3/slurm-clients -P "umount /pnfs/psi.ch/cms"= 1. comment in fstab /pnfs line to prevent mount after reboots <pre> for n in $(seq 10 59); do echo t3wn$n; ssh t3wn$n "sed -i 's/\(t3dcachedb03:\/pnfs\/psi.ch\/cms.*\)/# COMMENTED FOR DOWNTIME \1/' /etc/fstab"; done </pre> 1. stop puppet run on slurm clients (optional) 1. if thers is a shutdown of t3nfs02, then umount /work and on cliens: "sed -i 's/\(t3nfs02.*\)/# Downtime \1/' /etc/fstab" 1. and correspondingly for big maintenance days: umount /t3home and "sed -i 's/\(t3nfs*\)/# Downtime \1/' /etc/fstab" 1. Shut down the worker nodes 1. Shut down the nodes <pre> for n in $(seq 10 59) ; do echo t3wn$n; ssh !root@t3wn$n shutdown -h now ; sleep 1 ; done </pre> 1. Check whether all nodes are down <pre> for n in $(seq 10 20) 22 23 $(seq 25 47) $(seq 49 59) ; do node="t3wn$n"; echo -n "$node: "; ipmitool -I lanplus -H rmwn$n -U root -f /root/private/.ipmi-pw chassis power status ; done </pre> 1. Stop !PhEDEx on the t3cmsvobox (since it relies on dcache transfers). Notice that Phedex runs as the phedex user and not as root. <pre> ssh phedex@t3cmsvobox /home/phedex/config/T3_CH_PSI/PhEDEx/tools/init.d/phedex_Debug stop </pre> 1. service xrootd and cmsd stop on t3se01 like =ssh t3se01 systemctl stop xrootd@clustered= 1. dcache stop steps: 1. stop doors on t3se01 - xrootd, dcap/gsidcap, gsifttp, srm, xrootd - all visible from "dcache status" like <pre> ssh t3se01 dcache stop dcap-t3se01Domain</pre> and stop xrootd door on t3dcachedb03 1. pools t3fs07-11: =[root@t3admin02 ~]# pssh -h node-list-t3/dcache-pools -P "dcache stop"= 1. Unmount PNFS from the SE and DB servers: =ssh t3se01 umount /pnfs=; =ssh t3dcachedb03 umount /pnfs= 1. t3se01: =ssh t3se01 dcache stop= 1. Stop dcache services on the DB server : =ssh t3dcachedb03 dcache stop= 1. Stop Postgresql on the DB serve: =ssh t3dcachedb03 systemctl status/stop postgresql-11= 1. stop zookeeper on t3zkpr11-13: =for n in $(seq 1 3); do ssh t3zkpr1$n systemctl stop zookeeper.service; done= 1. Stop the BDII: =ssh !root@t3bdii "/etc/init.d/bdii stop"= 1. frontier - VMs ?: (Derek: I left all VMs running. They will be shut down by the VM team) 1. shutdown t3nfs02, t3gpu01-2, t3admin02; t3fs07-11: on t3fs07-10 first off the server and afterwards the JBOD 1. Shut down Netapp system ([[https://kb.netapp.com/app/answers/answer_view/a_id/1031100/~/how-to-power-off-an-e-series-storage-system-][Link]]) * Make sure, no background processes in operation (Santricity SMclient GUI) * Turn off controller enclosure * (Turn off any additional enclosure) ---+ Start Tier-3 1. power on hardware 1. on VM zookeeper nodes *t3zkpr11-13* check =systemctl status zookeeper= and =zkcli -server t3zkpr11= on t3zkpr11 1. *t3dcachedb03* check postgres: =systemctl start postgresql-11= and =systemctl status crond= and =dcache check-config= . Start all dcache main services =dcache start *Domain= besides doors (currently only one xrootd door configured on t3dcachedb03) 1. *t3se01*: start services beside doors from =dcache status= (like =dcache start info-t3se01Domain= currently the same for dcache-t3se01Domain, pinmanager-t3se01Domain, spacemanager-t3se01Domain, transfermanagers-t3se01Domain should be started ); doors (configured at the moment also on *t3se01*) should be stared after pools 1. =mount /pnfs= on *t3se01* and *t3dcachedb03* 1. start dcache on pools *t3fs01-11* =[root@t3admin02 ~]# pssh -h node-list-t3/fs-dalco -P "dcache start"= takes about 15-30' (in a case of hardware issue first switch on JBOD and than server) 1. check dcache logs in /var/log/dcache on pools, dcachedb and se machines 1. check if !NetApp is visible from t3fs11: =[root@t3nfs11 ~]# multipath -ll= 1. *t3se01* start doors (if not done yet): dcap, gsidcap, gsiftp, srm, xrootd (for list, see =dcache status=) like =dcache start dcap-t3se01Domain=, etc. and xrootd door on t3dcachedb03 1. *t3se01* check (and start) xrootd redirector =systemctl start cmsd@clustered= ; =systemctl start xrootd@clustered= 1. check on all UIs and WNs/CNs if /pnfs/psi.ch/cms is mounted like =pssh -h node-list-t3/slurm-clients -P "mount |grep pnfs"= 1. Slurm: on t3slurm =scontrol update !PartitionName=gpu State=UP= and =scontrol update !PartitionName=wn State=UP= , etc. for all Partitions 1. When all T3 is UP one can fulfill the following useful checks: * run test-dCacheProtocols from UI * [[MonitoringList][Monitoring List]]
This topic: CmsTier3
>
WebHome
>
WebPreferences
>
AdminArea2019
>
HowToStartStopT32019
Topic revision: r20 - 2020-06-03 - NinaLoktionova
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback