(r12) HowToStartStopT32019 < CmsTier3

Tags: view all tags
<!-- keep this as a security measure:
   #uncomment if the subject should only be modifiable by the listed groups 
   # * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup
   # * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup
   #uncomment this if you want the page only be viewable by the listed groups
   # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup,Main.CMSAdminReaderGroup
-->

---+ Announcing a Tier-3 shutdown
   * Announce the system halt on the [[https://goc.egi.eu/portal/index.php?Page_Type=Site&id=271][GOC Pages]]
   (24h before if it is a *scheduled* downtime)
   * Announce the system halt on =cms-tier3-users@lists.psi.ch=

---+ Shutting down the Tier-3

Temporary note: This sequence is based on a mail by Nina and I (Derek) followed the sequence and adapted in some places. I added explicit commands where I could.
TODO (Nina):
   * check sequence and provide explicit (and homogeneous as far as possible) commands.
   * the commands must be runnable from t3admin01, not from the laptop. I.e. configuration files that define the names of worker nodes, service nodes, etc. and that help to send parallel commands must be available on t3admin01. I do not care which parallel mechanism is used (cexec or pssh, etc). But the configuration and commands to run must be explicitly in this list and the config must be local to the admin node. 
   * the ssh keys in .ssh/known_hosts on t3admin01 were completely out of date, and seemingly a lot of the WN keys have changed. This prevents working from t3admin01
   * /etc/hosts contains a number of obsolete entries (MeG nodes still inside). Also, we need to define whether the public addresses are kept within that file or only the private ones. At the moment the public addresses are incomplete.

Sequence:

 *Before* Downtime
   1. do announcement in :
      * t3 user list: cms-tier3-users@lists.psi.ch  
      *  t3 admin wiki
      *  COGDB: https://goc.egi.eu/portal/index.php?Page_Type=Add_Downtime  (To be SCHEDULED, start must be 24hrs in the future)
   1.  stop snapshot script of /work (on t3nfs02 in /etc/cron.daily/zfssnap comment line  </br> =# /opt/zfssnap/zfssnap $PERIOD $MAXSNAPS $BACKUPSERVER >>$LOG 2>&1= )
   1.  when possible prepare FW updates on !NetApp and HPs (t3nfs01,02 and t3admin02),  etc. and prepare for yum + kernel updates

Downtime day:

   1. Prevent further logins to the user interfaces. Modify =/etc/security/access.conf= on the user interfaces by commenting out the lines that allow access for all CMS users and for NX
   <pre>
#+ : cms : :0 129.129.0.0/16
+ : feichtinger loktionova_n : : ALL
#+ : cms : ALL
#+ : nx : ALL
- : ALL : ALL

   </pre>
   1. Stop Nagios
   <pre>
ssh root@t3nagios /etc/init.d/nagios stop 
   </pre> 
   1. disable all user queues/ Slurm Partitions on the WNs:
   <pre>
  ssh t3slurm "scontrol update PartitionName=gpu State=DRAIN;scontrol update PartitionName=wn State=DRAIN; scontrol update PartitionName=qgpu State=DRAIN;scontrol update PartitionName=quick State=DRAIN "
   </pre>  
   1. Delete any remaining jobs in the queue system
   1. Unmount PNFS on the nodes
      1. umount /pnfs an all nodes:      UIs, WNs,      t3se01 and t3dcachedb03
      from t3admin02: <pre> pssh -h node-list-t3/slurm-clients -P "umount /pnfs/psi.ch/cms" </pre>
      1. comment in fstab /pnfs line to prevent mount after reboots
      <pre>
      for n in $(seq 10 59); do echo t3wn$n; ssh t3wn$n "sed -i 's/\(t3dcachedb03:\/pnfs\/psi.ch\/cms.*\)/# COMMENTED FOR DOWNTIME \1/' /etc/fstab"; done
     </pre>
   1. if thers is a shutdown of  t3nfs02, then umount /work  and on cliens: "sed -i 's/\(t3nfs02.*\)/# Downtime \1/' /etc/fstab"
   1.    and correspondingly for big maintenance days:
        umount /t3home and "sed -i 's/\(t3nfs*\)/# Downtime \1/' /etc/fstab"
   1. Shut down the worker nodes
      1. Shut down the nodes
      <pre>
      for n in $(seq 10 59) ; do echo t3wn$n; ssh root@t3wn$n shutdown -h now ; sleep 1 ; done
      </pre>
      1. Check whether all nodes are down
      <pre>
for n in $(seq 10 20) 22 23 $(seq 25 47) $(seq 49 59) ; do node="t3wn$n"; echo -n "$node: "; ipmitool -I lanplus -H rmwn$n -U root -f /root/private/ipmi-pw chassis power status ; done
      </pre>
   1. Stop PhEDEx on the t3cmsvobox (since it relies on dcache transfers). Notice that Phedex runs as the phedex user and not as root.
     <pre>
      ssh phedex@t3cmsvobox /home/phedex/config/T3_CH_PSI/PhEDEx/tools/init.d/phedex_Debug stop
     </pre>
   1. service xrootd/cmsd stop      on t3se01 
     <pre>
       ssh t3se01 service xrootd stop
       ssh t3se01 service cmsd stop
     </pre>
   1. dcache stop
      1. stop doors on t3se01  - xrootd, dcap/gsidcap, gsifttp, srm, xrootd - all visible from "dcache status" like <pre> ssh  t3se01 dcache stop dcap-t3se01Domain</pre>
      1. pools t3fs07-11:   =[root@t3admin02 ~]# pssh  -h node-list-t3/dcache-pools -P "dcache stop"=
      1. Unmount PNFS from the SE and DB servers
      <pre>
      ssh t3se01 umount /pnfs
      ssh t3dcachedb03 umount /pnfs
      </pre>
      1. t3se01:  
  <pre>
  ssh t3se01 service dcache-server stop
  </pre>
      1.  Stop dcache services on the DB server  
  <pre>
  ssh t3dcachedb03 service dcache-server stop
  </pre>
      1.  Stop Postgresql on the DB server  
  <pre>
  ssh t3dcachedb03 /etc/init.d/postgresql-9.5 stop
  </pre>
      1. stop zookeeper on t3zkpr01-03
       <pre>
        for n in $(seq 1 3); do ssh t3zkpr0$n systemctl stop zookeeper.service; done 
       </pre>
   1. Stop the BDII
   <pre>
   ssh root@t3bdii "/etc/init.d/bdii stop"
   </pre>
   1. frontier - VMs ?: (Derek: I left all VMs running. They will be shut down by the VM team)       
   1. shutdown t3nfs01, t3nfs02, t3fs07-11, t3gpu01-2, t3admin01/02 
   1. on   t3fs07-10 first off the server and   afterwards the JBOD    
   1. Shut down Netapp system ([[https://kb.netapp.com/app/answers/answer_view/a_id/1031100/~/how-to-power-off-an-e-series-storage-system-][Link]]) 
      * Make sure, no background processes in operation (Santricity SMclient GUI) 
      * Turn off controller enclosure 
      * (Turn off any additional enclosure) 


---+ Start  Tier-3
   1. power on hardware
   1. on VM zookeeper nodes  *t3zkpr01-03* check  =systemctl status zookeeper=   and   =zkcli -server t3zkpr01=
   1. *t3dcachedb03* check postgres: =/etc/init.d/postgresql-9.5 status/start=  and  =service crond status/start=   and    =dcache check-config= and  start all dcache main services  =dcache  start=   
   1. *t3se01*: start  services  beside doors  from =dcache  status= (like   =dcache start info-t3se01Domain= currently the same for  dcache-t3se01Domain, pinmanager-t3se01Domain, spacemanager-t3se01Domain, transfermanagers-t3se01Domain  should be started ); doors  (configured at the moment also on *t3se01*) should be stared after pools 
   1.   =mount /pnfs= on  *t3se01* and *t3dcachedb03* 
   1. start dcache on pools *t3fs01-11* =[root@t3admin02 ~]# pssh -h node-list-t3/fs-dalco -P "dcache  start"=    takes about 15-30'  (in a case of hardware issue  first switch on JBOD and than server)
   1. check dcache logs in /var/log/dcache on pools, dcachedb and se machines
   1. check if !NetApp is visible from t3fs11: =[root@t3nfs11 ~]# multipath -ll=
   1.  *t3se01* start doors (if not done yet): dcap, gsidcap, gsiftp, srm, xrootd (for list, see =dcache status=) like =dcache start  dcap-t3se01Domain=, etc.
   1. *t3se01* check (and start) xrootd redirector  =service cmsd/xrootd  status=
   1. check on all UIs and WNs/CNs if /pnfs/psi.ch/cms is mounted  like =pssh -h node-list-t3/slurm-clients -P "mount |grep pnfs"=
   1. Slurm: on t3slurm =scontrol update !PartitionName=gpu State=UP= and =scontrol update !PartitionName=wn State=UP=
   1. run   test-dCacheProtocols from UI
   1.  When all T3 is UP one can fulfill  the following useful checks: 
      * http://t3mon.psi.ch/  and https://icinga.psi.ch/
      *  https://t3nagios.psi.ch/check_mk/index.py?start_url=%2Fcheck_mk%2Fview.py%3Fview_name%3Dhosts%26host%3Dt3bdii02
      * https://t3nagios.psi.ch/check_mk/index.py?start_url=%2Fcheck_mk%2Fview.py%3Fview_name%3Dhost%26host%3Dt3cmsvobox01%26site%3D
      * https://etf-cms-prod.cern.ch/etf/check_mk/view.py?view_name=service&service=org.cms.SRM-VOPut-/cms/Role=production&host=t3se01.psi.ch
      *  phedex: https://cmsweb.cern.ch/phedex/prod/Components::Status
      * batch: *t3ce02* check/enable WNs    =qmod -e *@*=    and check sge status:     =/etc/init.d/sgedbwriter.p6444 status=  
      * Slurm check: =sinfo=


-- Main.DerekFeichtinger - 2019-01-03