Tags:
create new tag
view all tags

CMS Tier-3 Upgrade Planning Page

Yearly Shutdown and dCache upgrade to 1.9.12

Summary

shutting down of all services for yearly PSI shutdown. The systems can stay up, but network is lost for some time on Jan 12th. Upgrade of dcache to 1.9.12 on Jan 14.

Details

Since some of our Switches will get SW upgrades, we will also lose all T3 intranet connectivity for a short while. The virtualization infrastructure will be offline for some considerable time over this weekend, so a lot of our service nodes will be down. We should shut down the UIs and WNs.

Notes

This is the link for the general PSI compute center shutdown on 2013-01-12: https://intranet.psi.ch/AIT/SystemTest2013

  • t3ce01
    • What for are the rsync jobs that seem to copy the ganglia rrds continually to the /dev/shm? Seem to be ancient rsync stuck jobs. I deleted them
    • t3ce01 still runs all ganglia connectors
  • t3mon01
    • shows high memory and CPU usage. Might be an error condition that arose on Fri 11th, but also I see that ganglia is not implemented in the way I had prototyped for CSCS and PSI: The RRD data base files should reside in memory and should get synched to a disk area every few minutes. That prevents the system from being overcome by the mani IOPs that are stupidly generated by so many RRD db files (and the way ganglia is organized, there is one file for every metric and host)
  • Minor issues
    • t3admin01
      • please remove obsolete files that are clogging up /root directory
    • Terminal colors on newer nodes: Please disable color output. It is not readable for people who use white background (e.g. yellow color on white is invisible)

Shutting down of all services

Remarks added by Fabio
  1. Stop Nagios: ssh root@t3nagios /etc/init.d/nagios stop
  2. Prevent users from logging in by modifying /etc/security/access.conf. Reboot all UIs to get rid of sessions.
    Only t3ui05 and t3ui06 managed to reboot! 
    t3ui03 I had to shut off manually, since the ILOM has not been reachable since last summer. Even after manual power on, it did not come back up.
    The other machines I had to forcefully shut down and turn on using the ILOMs.
    
    [root@t3admin01 MAINTENANCE]# cexec ui: uptime
    ************************* ui *************************
    --------- t3ui02---------
    ssh: connect to host t3ui02 port 22: Connection refused
    --------- t3ui03---------
    ssh: connect to host t3ui03 port 22: No route to host
    --------- t3ui04---------
    ssh: connect to host t3ui04 port 22: No route to host
    --------- t3ui05---------
     16:09:05 up 3 min,  0 users,  load average: 0.29, 0.23, 0.10
    --------- t3ui06---------
     16:09:06 up 3 min,  0 users,  load average: 0.20, 0.20, 0.09
    --------- t3ui07---------
    ssh: connect to host t3ui07 port 22: No route to host
    
  3. Kill all surviving SGE jobs.
  4. Shut down SGE on all WNs: cexec wn: /etc/init.d/sgeexecd.p6444 stop and cexec intel: /etc/init.d/sgeexecd.p6444 stop
  5. Shut down SGE master: ssh t3ce "/etc/init.d/sgedbwriter.p6444 stop; /etc/init.d/sgemaster.p6444 stop" There are a couple of nohup tail running based on the SGE pid, please leave SGE ON, simply start iptables ssh t3ce service iptables start
  6. Stop Crond on t3se01,02,t3dcachedb01,04 t3dcachedb01 cron already OFF since last upgrade, same for NEW t3se02,t3dcachedb02
  7. Shut down phedex services on t3cmsvobox
  8. shut down all worker nodes
  9. shut down all UIs I decided to leave the UIs up (Derek)
  10. Stop the production dcache (StartStopDcache215)
  11. SKIP Stop NFS swshare services (showmount -a for finding clients)
    1. unmount NFS home from all WNs and UIs remember cexec wn: and cexec intel: , or simply add to wn: also the intel WNs once and forever
    2. t3cmsvobox02.psi.ch:/swshare
    3. t3vmui01.psi.ch:/swshare
    4. STALE: t3vm01.psi.ch:/swshare
  12. SKIP Stop NFS shome services (showmount -a for finding clients)
    • all UIs
    • t3vmui01.psi.ch:/shome
    • all WNs
    • t3ce01.psi.ch:/shome
    • t3ce02.psi.ch:/shome
    • t3dcachedb01.psi.ch:/shome
    • t3dcachedb02.psi.ch:/shome/monuser
    • t3dcachedb04.psi.ch:/shome/monuser
    • t3ldap01.psi.ch:/shome
    • t3mon01.psi.ch:/shome
    • t3nagios.psi.ch:/shome/martinelli_f
    • t3nfs01.psi.ch:/shome outdated
    • t3se02.psi.ch:/shome/martinelli_f outdated
    • t3cmsvobox02.psi.ch:/shome
    • DEAD ---t3vm01.psi.ch:/shome---
    • DOWN ---t3dcachedb03.psi.ch:/shome---
    • DEAD ---t3vm03.psi.ch:/shome---
    • DEAD ---t3vmmaster01.psi.ch:/vmshare---

dCache Upgrade

Assuming both t3se01 and t3dachedb04 UP but dCache stopped everywhere ( also on t3se02 and t3dcachedb02 ):
  1. Stop t3se01 because t3se02 will become the new t3se01
  2. Stop t3dcachedb04 Crond if found ON.
  3. Unfortunately all our daily pgdumpall backups are probably not consistent because by design the full is taken by the tool in serial instead of in parallel, so let's take a final full by running as postgres /var/lib/pgsql_backups/dcache-db-backup.sh ; this full backup is going to be copied also into t3fs05:/swshare/dcache-postgres-backups/t3dcachedb04 where later the Fabio's migration script on t3dcachedb02 will search it.
  4. Go into t3dcachedb02:/root/DCACHE-1.9.12-MIGRATION and run source ingest.db.backup.sh to ingest the latest full backup from t3fs05.
  5. convert t3se02 in t3se01 , check that x509s and conf files are ok, Puppet will do it but still check.
  6. turn on dCache on t3dcachedb02 and mount /pnfs
  7. turn on dCache on t3se02 and mount /pnfs
  8. upgrade dCache on each t3fs*, remember that you have to change the owner of pools files and dirs from root to dcache, REF . Consider t3fs*:/root/dcache.conf and layout t3fs*:/root/t3fs*.conf

UpgradePlanningForm
Title Yearly Shutdown and dCache upgrade to 1.9.12
Summary shutting down of all services for yearly PSI shutdown. The systems can stay up, but network is lost for some time on Jan 12th. Upgrade of dcache to 1.9.12 on Jan 14.
Target Date 11. 01. 2013
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r13 - 2016-06-08 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback