Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup --> ---+ Shutting down the Tier-3 *Before* Downtime 1. do announcement in : * t3 user list: cms-tier3-users@lists.psi.ch * t3 admin wiki * COGDB: https://goc.egi.eu/portal/index.php?Page_Type=Add_Downtime (To be SCHEDULED, start must be 24hrs in the future) 1. check list of nodes on *t3admin02* in *node-list-t3* directories 1. stop snapshot script of /work (on t3nfs02 in /etc/cron.daily/zfssnap comment line </br> =# /opt/zfssnap/zfssnap $PERIOD $MAXSNAPS $BACKUPSERVER >>$LOG 2>&1= ) 1. when possible prepare FW updates on !NetApp and HPs (t3nfs01,02 and t3admin02), etc. and prepare yum + kernel updates Downtime day: 1. Prevent further logins to the user interfaces. Modify =/etc/security/access_users.conf= on the user interfaces by commenting out the lines that allow access for all CMS users <pre> #+ : cms : ALL - : ALL : ALL </pre> 1. stop icinga notifications: on *emonma00* node in =/opt/icinga-config/tier3/objects/tier3_templates.cfg= comment the line with members like <pre> define contactgroup{ contactgroup_name t3-admins alias TIER3 Administrators # members ...................... </pre> 1. disable all user queues/ all Slurm Partitions: <pre> ssh t3slurm "scontrol update PartitionName=gpu State=DRAIN;scontrol update PartitionName=wn State=DRAIN; scontrol update PartitionName=qgpu State=DRAIN;scontrol update PartitionName=quick State=DRAIN " </pre> 1. Delete any remaining jobs in the queue system 1. Unmount PNFS on the nodes 1. umount /pnfs an all nodes: UIs, WNs (from t3admin02): =pssh -h node-list-t3/slurm-clients -P "umount /pnfs/psi.ch/cms"= 1. comment in fstab /pnfs line to prevent mount after reboots <pre> for n in $(seq 10 59); do echo t3wn$n; ssh t3wn$n "sed -i 's/\(t3dcachedb03:\/pnfs\/psi.ch\/cms.*\)/# COMMENTED FOR DOWNTIME \1/' /etc/fstab"; done </pre> 1. stop puppet run on slurm clients (optional) 1. if thers is a shutdown of t3nfs02, then umount /work and on cliens: "sed -i 's/\(t3nfs02.*\)/# Downtime \1/' /etc/fstab" 1. and correspondingly for big maintenance days: umount /t3home and "sed -i 's/\(t3nfs*\)/# Downtime \1/' /etc/fstab" 1. Shut down the worker nodes 1. Shut down the nodes <pre> for n in $(seq 10 59) ; do echo t3wn$n; ssh !root@t3wn$n shutdown -h now ; sleep 1 ; done </pre> 1. Check whether all nodes are down <pre> for n in $(seq 10 20) 22 23 $(seq 25 47) $(seq 49 59) ; do node="t3wn$n"; echo -n "$node: "; ipmitool -I lanplus -H rmwn$n -U root -f /root/private/.ipmi-pw chassis power status ; done </pre> 1. Stop !PhEDEx on the t3cmsvobox (since it relies on dcache transfers). Notice that Phedex runs as the phedex user and not as root. <pre> ssh phedex@t3cmsvobox /home/phedex/config/T3_CH_PSI/PhEDEx/tools/init.d/phedex_Debug stop </pre> 1. service xrootd and cmsd stop on t3se01 like =ssh t3se01 systemctl stop xrootd@clustered= 1. dcache stop steps: 1. stop doors on t3se01 - xrootd, dcap/gsidcap, gsifttp, srm, xrootd - all visible from "dcache status" like <pre> ssh t3se01 dcache stop dcap-t3se01Domain</pre> and stop xrootd door on t3dcachedb03 1. pools t3fs07-11: =[root@t3admin02 ~]# pssh -h node-list-t3/dcache-pools -P "dcache stop"= 1. Unmount PNFS from the SE and DB servers: =ssh t3se01 umount /pnfs=; =ssh t3dcachedb03 umount /pnfs= 1. t3se01: =ssh t3se01 dcache stop= 1. Stop dcache services on the DB server : =ssh t3dcachedb03 dcache stop= 1. Stop Postgresql on the DB serve: =ssh t3dcachedb03 systemctl status/stop postgresql-11= 1. stop zookeeper on t3zkpr11-13: =for n in $(seq 1 3); do ssh t3zkpr1$n systemctl stop zookeeper.service; done= 1. Stop the BDII: =ssh !root@t3bdii "/etc/init.d/bdii stop"= 1. frontier - VMs ?: (Derek: I left all VMs running. They will be shut down by the VM team) 1. shutdown t3nfs02, t3gpu01-2, t3admin02; t3fs07-11: on t3fs07-10 first off the server and afterwards the JBOD 1. Shut down Netapp system ([[https://kb.netapp.com/app/answers/answer_view/a_id/1031100/~/how-to-power-off-an-e-series-storage-system-][Link]]) * Make sure, no background processes in operation (Santricity SMclient GUI) * Turn off controller enclosure * (Turn off any additional enclosure) ---+ Start Tier-3 1. power on hardware 1. on VM zookeeper nodes *t3zkpr11-13* check =systemctl status zookeeper= and =zkcli -server t3zkpr11= on t3zkpr11 1. *t3dcachedb03* check postgres: =systemctl start postgresql-11= and =systemctl status crond= and =dcache check-config= . Start all dcache main services =dcache start *Domain= besides doors (currently only one xrootd door configured on t3dcachedb03) 1. *t3se01*: start services beside doors from =dcache status= (like =dcache start info-t3se01Domain= currently the same for dcache-t3se01Domain, pinmanager-t3se01Domain, spacemanager-t3se01Domain, transfermanagers-t3se01Domain should be started ); doors (configured at the moment also on *t3se01*) should be stared after pools 1. =mount /pnfs= on *t3se01* and *t3dcachedb03* 1. start dcache on pools *t3fs01-11* =[root@t3admin02 ~]# pssh -h node-list-t3/fs-dalco -P "dcache start"= takes about 15-30' (in a case of hardware issue first switch on JBOD and than server) 1. check dcache logs in /var/log/dcache on pools, dcachedb and se machines 1. check if !NetApp is visible from t3fs11: =[root@t3nfs11 ~]# multipath -ll= 1. *t3se01* start doors (if not done yet): dcap, gsidcap, gsiftp, srm, xrootd (for list, see =dcache status=) like =dcache start dcap-t3se01Domain=, etc. and xrootd door on t3dcachedb03 1. *t3se01* check (and start) xrootd redirector =systemctl start cmsd@clustered= ; =systemctl start xrootd@clustered= 1. check on all UIs and WNs/CNs if /pnfs/psi.ch/cms is mounted like =pssh -h node-list-t3/slurm-clients -P "mount |grep pnfs"= 1. Slurm: on t3slurm =scontrol update !PartitionName=gpu State=UP= and =scontrol update !PartitionName=wn State=UP= , etc. for all Partitions 1. When all T3 is UP one can fulfill the following useful checks: * run test-dCacheProtocols from UI * [[MonitoringList][Monitoring List]]
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r20
<
r19
<
r18
<
r17
<
r16
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r20 - 2020-06-03
-
NinaLoktionova
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback