HowToStartStopT32019 < CmsTier3

Tags: view all tags
<!-- keep this as a security measure:
   #uncomment if the subject should only be modifiable by the listed groups 
   # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup
   # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup
   #uncomment this if you want the page only be viewable by the listed groups
   # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup
-->


---+ Shutting down the Tier-3

 *Before* Downtime
   1. do announcement in :
      * t3 user list: cms-tier3-users@lists.psi.ch  
      *  t3 admin wiki
      *  COGDB: https://goc.egi.eu/portal/index.php?Page_Type=Add_Downtime  (To be SCHEDULED, start must be 24hrs in the future)
   1. check list of nodes on *t3admin02* in *node-list-t3* directories
   1.  stop snapshot script of /work (on t3nfs02 in /etc/cron.daily/zfssnap comment line  </br> =# /opt/zfssnap/zfssnap $PERIOD $MAXSNAPS $BACKUPSERVER >>$LOG 2>&1= )
   1.  when possible prepare FW updates on !NetApp and HPs (t3nfs01,02 and t3admin02),  etc. and prepare yum + kernel updates

Downtime day:

   1. Prevent further logins to the user interfaces. Modify =/etc/security/access_users.conf= on the user interfaces by commenting out the lines that allow access for all CMS users 
   <pre>
#+ : cms : ALL
- : ALL : ALL
   </pre>
   1. stop icinga notifications:
       on *emonma00* node in 
 =/opt/icinga-config/tier3/objects/tier3_templates.cfg= comment the line with members like
 <pre>
define contactgroup{
        contactgroup_name       t3-admins
        alias                   TIER3 Administrators
   #    members          ......................
 </pre>
   1. disable all user queues/ all Slurm Partitions:
   <pre>
  ssh t3slurm "scontrol update PartitionName=gpu State=DRAIN;scontrol update PartitionName=wn State=DRAIN; scontrol update PartitionName=qgpu State=DRAIN;scontrol update PartitionName=quick State=DRAIN "
   </pre>  
   1. Delete any remaining jobs in the queue system
   1. Unmount PNFS on the nodes
      1. umount /pnfs an all nodes:      UIs, WNs (from t3admin02): =pssh -h node-list-t3/slurm-clients -P "umount /pnfs/psi.ch/cms"=
      1. comment in fstab /pnfs line to prevent mount after reboots
      <pre>
      for n in $(seq 10 59); do echo t3wn$n; ssh t3wn$n "sed -i 's/\(t3dcachedb03:\/pnfs\/psi.ch\/cms.*\)/# COMMENTED FOR DOWNTIME \1/' /etc/fstab"; done
     </pre>
      1. stop puppet run on slurm clients (optional) 
   1. if thers is a shutdown of  t3nfs02, then umount /work  and on cliens: "sed -i 's/\(t3nfs02.*\)/# Downtime \1/' /etc/fstab"
   1.    and correspondingly for big maintenance days:
        umount /t3home and "sed -i 's/\(t3nfs*\)/# Downtime \1/' /etc/fstab"
   1. Shut down the worker nodes
      1. Shut down the nodes
      <pre>
      for n in $(seq 10 59) ; do echo t3wn$n; ssh !root@t3wn$n shutdown -h now ; sleep 1 ; done
      </pre>
      1. Check whether all nodes are down
      <pre>
for n in $(seq 10 20) 22 23 $(seq 25 47) $(seq 49 59) ; do node="t3wn$n"; echo -n "$node: "; ipmitool -I lanplus -H rmwn$n -U root -f /root/private/.ipmi-pw chassis power status ; done

      </pre>
   1. Stop !PhEDEx on the t3cmsvobox (since it relies on dcache transfers). Notice that Phedex runs as the phedex user and not as root.
     <pre>
      ssh phedex@t3cmsvobox /home/phedex/config/T3_CH_PSI/PhEDEx/tools/init.d/phedex_Debug stop
     </pre>
   1. service xrootd and cmsd stop on t3se01 like =ssh t3se01  systemctl stop xrootd@clustered= 
   1. dcache stop steps:
      1. stop doors on t3se01  - xrootd, dcap/gsidcap, gsifttp, srm, xrootd - all visible from "dcache status" like <pre> ssh  t3se01 dcache stop dcap-t3se01Domain</pre> and stop xrootd door on t3dcachedb03
      1. pools t3fs07-11:   =[root@t3admin02 ~]# pssh  -h node-list-t3/dcache-pools -P "dcache stop"=
      1. Unmount PNFS from the SE and DB servers: =ssh t3se01 umount /pnfs=; =ssh t3dcachedb03 umount /pnfs=
      1. t3se01:  =ssh t3se01 dcache stop=
      1.  Stop dcache services on the DB server : =ssh t3dcachedb03 dcache stop=
      1.  Stop Postgresql on the DB serve: =ssh t3dcachedb03 systemctl status/stop postgresql-11= 
      1. stop zookeeper on t3zkpr11-13: =for n in $(seq 1 3); do ssh t3zkpr1$n systemctl stop zookeeper.service; done=
   1. Stop the BDII: =ssh !root@t3bdii "/etc/init.d/bdii stop"=   
   1. frontier - VMs ?: (Derek: I left all VMs running. They will be shut down by the VM team)       
   1. shutdown t3nfs02, t3gpu01-2, t3admin02; t3fs07-11: on   t3fs07-10 first off the server and   afterwards the JBOD   
   1. Shut down Netapp system ([[https://kb.netapp.com/app/answers/answer_view/a_id/1031100/~/how-to-power-off-an-e-series-storage-system-][Link]]) 
      * Make sure, no background processes in operation (Santricity SMclient GUI) 
      * Turn off controller enclosure 
      * (Turn off any additional enclosure) 


---+ Start  Tier-3
   1. power on hardware
   1. on VM zookeeper nodes  *t3zkpr11-13* check  =systemctl status zookeeper=   and   =zkcli -server t3zkpr11= on t3zkpr11
   1. *t3dcachedb03* check postgres: =systemctl start postgresql-11=  and  =systemctl status crond=   and    =dcache check-config= . Start all dcache main services  =dcache  start *Domain=    besides doors (currently only one xrootd door configured on t3dcachedb03)
   1. *t3se01*: start  services  beside doors  from =dcache  status= (like   =dcache start info-t3se01Domain= currently the same for  dcache-t3se01Domain, pinmanager-t3se01Domain, spacemanager-t3se01Domain, transfermanagers-t3se01Domain  should be started ); doors  (configured at the moment also on *t3se01*) should be stared after pools 
   1.   =mount /pnfs= on  *t3se01* and *t3dcachedb03* 
   1. start dcache on pools *t3fs01-11* =[root@t3admin02 ~]# pssh -h node-list-t3/fs-dalco -P "dcache  start"=    takes about 15-30'  (in a case of hardware issue  first switch on JBOD and than server)
   1. check dcache logs in /var/log/dcache on pools, dcachedb and se machines
   1. check if !NetApp is visible from t3fs11: =[root@t3nfs11 ~]# multipath -ll=
   1.  *t3se01* start doors (if not done yet): dcap, gsidcap, gsiftp, srm, xrootd (for list, see =dcache status=) like =dcache start  dcap-t3se01Domain=, etc. and xrootd door  on t3dcachedb03
   1. *t3se01* check (and start) xrootd redirector  =systemctl start cmsd@clustered= ; =systemctl start xrootd@clustered= 
   1. check on all UIs and WNs/CNs if /pnfs/psi.ch/cms is mounted  like =pssh -h node-list-t3/slurm-clients -P "mount |grep pnfs"=
   1. Slurm: on t3slurm =scontrol update !PartitionName=gpu State=UP= and =scontrol update !PartitionName=wn State=UP= , etc. for all Partitions
   1.  When all T3 is UP one can fulfill  the following useful checks: 
      * run   test-dCacheProtocols from UI
      *   [[MonitoringList][Monitoring List]]
Topic revision: r20 - 2020-06-03 - NinaLoktionova
CmsTier3
User Pages
Main Page
Policies
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs