Tags:
create new tag
view all tags

Shutting down the Tier-3

Before Downtime

  1. do announcement in :
  2. check list of nodes on t3admin02 in node-list-t3 directories
  3. stop snapshot script of /work (on t3nfs02 in /etc/cron.daily/zfssnap comment line
    # /opt/zfssnap/zfssnap $PERIOD $MAXSNAPS $BACKUPSERVER >>$LOG 2>&1 )
  4. when possible prepare FW updates on NetApp and HPs (t3nfs01,02 and t3admin02), etc. and prepare yum + kernel updates

Downtime day:

  1. Prevent further logins to the user interfaces. Modify /etc/security/access_users.conf on the user interfaces by commenting out the lines that allow access for all CMS users
    #+ : cms : ALL
    - : ALL : ALL
       
  2. stop icinga notifications: on emonma00 node in /opt/icinga-config/tier3/objects/tier3_templates.cfg comment the line with members like
    define contactgroup{
            contactgroup_name       t3-admins
            alias                   TIER3 Administrators
       #    members          ......................
     
  3. disable all user queues/ all Slurm Partitions:
      ssh t3slurm "scontrol update PartitionName=gpu State=DRAIN;scontrol update PartitionName=wn State=DRAIN; scontrol update PartitionName=qgpu State=DRAIN;scontrol update PartitionName=quick State=DRAIN "
       
  4. Delete any remaining jobs in the queue system
  5. Unmount PNFS on the nodes
    1. umount /pnfs an all nodes: UIs, WNs (from t3admin02): pssh -h node-list-t3/slurm-clients -P "umount /pnfs/psi.ch/cms"
    2. comment in fstab /pnfs line to prevent mount after reboots
            for n in $(seq 10 59); do echo t3wn$n; ssh t3wn$n "sed -i 's/\(t3dcachedb03:\/pnfs\/psi.ch\/cms.*\)/# COMMENTED FOR DOWNTIME \1/' /etc/fstab"; done
           
    3. stop puppet run on slurm clients (optional)
  6. if thers is a shutdown of t3nfs02, then umount /work and on cliens: "sed -i 's/\(t3nfs02.*\)/# Downtime \1/' /etc/fstab"
  7. and correspondingly for big maintenance days: umount /t3home and "sed -i 's/\(t3nfs*\)/# Downtime \1/' /etc/fstab"
  8. Shut down the worker nodes
    1. Shut down the nodes
            for n in $(seq 10 59) ; do echo t3wn$n; ssh !root@t3wn$n shutdown -h now ; sleep 1 ; done
            
    2. Check whether all nodes are down
      for n in $(seq 10 20) 22 23 $(seq 25 47) $(seq 49 59) ; do node="t3wn$n"; echo -n "$node: "; ipmitool -I lanplus -H rmwn$n -U root -f /root/private/.ipmi-pw chassis power status ; done
      
            
  9. Stop PhEDEx on the t3cmsvobox (since it relies on dcache transfers). Notice that Phedex runs as the phedex user and not as root.
          ssh phedex@t3cmsvobox /home/phedex/config/T3_CH_PSI/PhEDEx/tools/init.d/phedex_Debug stop
         
  10. service xrootd and cmsd stop on t3se01 like ssh t3se01  systemctl stop xrootd@clustered
  11. dcache stop steps:
    1. stop doors on t3se01 - xrootd, dcap/gsidcap, gsifttp, srm, xrootd - all visible from "dcache status" like
       ssh  t3se01 dcache stop dcap-t3se01Domain
      and stop xrootd door on t3dcachedb03
    2. pools t3fs07-11: [root@t3admin02 ~]# pssh  -h node-list-t3/dcache-pools -P "dcache stop"
    3. Unmount PNFS from the SE and DB servers: ssh t3se01 umount /pnfs; ssh t3dcachedb03 umount /pnfs
    4. t3se01: ssh t3se01 dcache stop
    5. Stop dcache services on the DB server : ssh t3dcachedb03 dcache stop
    6. Stop Postgresql on the DB serve: ssh t3dcachedb03 systemctl status/stop postgresql-11
    7. stop zookeeper on t3zkpr11-13: for n in $(seq 1 3); do ssh t3zkpr1$n systemctl stop zookeeper.service; done
  12. Stop the BDII: ssh root@t3bdii "/etc/init.d/bdii stop"
  13. frontier - VMs ?: (Derek: I left all VMs running. They will be shut down by the VM team)
  14. shutdown t3nfs02, t3gpu01-2, t3admin02; t3fs07-11: on t3fs07-10 first off the server and afterwards the JBOD
  15. Shut down Netapp system (Link)
    • Make sure, no background processes in operation (Santricity SMclient GUI)
    • Turn off controller enclosure
    • (Turn off any additional enclosure)

Start Tier-3

  1. power on hardware
  2. on VM zookeeper nodes t3zkpr11-13 check systemctl status zookeeper and zkcli -server t3zkpr11 on t3zkpr11
  3. t3dcachedb03 check postgres: systemctl start postgresql-11 and systemctl status crond and dcache check-config . Start all dcache main services dcache  start *Domain besides doors (currently only one xrootd door configured on t3dcachedb03)
  4. t3se01: start services beside doors from dcache  status (like dcache start info-t3se01Domain currently the same for dcache-t3se01Domain, pinmanager-t3se01Domain, spacemanager-t3se01Domain, transfermanagers-t3se01Domain should be started ); doors (configured at the moment also on t3se01) should be stared after pools
  5. mount /pnfs on t3se01 and t3dcachedb03
  6. start dcache on pools t3fs01-11 [root@t3admin02 ~]# pssh -h node-list-t3/fs-dalco -P "dcache  start" takes about 15-30' (in a case of hardware issue first switch on JBOD and than server)
  7. check dcache logs in /var/log/dcache on pools, dcachedb and se machines
  8. check if NetApp is visible from t3fs11: [root@t3nfs11 ~]# multipath -ll
  9. t3se01 start doors (if not done yet): dcap, gsidcap, gsiftp, srm, xrootd (for list, see dcache status) like dcache start  dcap-t3se01Domain, etc. and xrootd door on t3dcachedb03
  10. t3se01 check (and start) xrootd redirector systemctl start cmsd@clustered ; systemctl start xrootd@clustered
  11. check on all UIs and WNs/CNs if /pnfs/psi.ch/cms is mounted like pssh -h node-list-t3/slurm-clients -P "mount |grep pnfs"
  12. Slurm: on t3slurm scontrol update PartitionName=gpu State=UP and scontrol update PartitionName=wn State=UP , etc. for all Partitions
  13. When all T3 is UP one can fulfill the following useful checks:

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r20 - 2020-06-03 - NinaLoktionova
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback