Tags:
create new tag
view all tags

CMS Tier-3 Upgrade Planning Page

Scheduled PSI downtime

Summary

Yearly maintenance work to be done in the PSI computing center

Details

before downtime: preparation work

  1. Dowtime Announcement
    • to users through mailing list and news section (done already in December)
    • to GOCDB (we only were able to do it on Jan 5th)
  2. Put SGE queues in draining mode according to their runtime limits

Downtime start on 2012-01-06, 15h

  1. Stop the Nagios process on t3nagios
  2. Barring access to users and killing all user sessions
    • set /etc/security/access.conf for all login machines to only allow admin access
    • reboot all UIs
  3. ...
  4. Shortly before 15:30h switch to the new virtual monitoring node t3mon01
    • 1 DNS alias entry will be changed by Mauro at 15:30h
  5. Shut down old virtualization environment WILL LEAVE THEM ON
    • t3vmmaster01: hosts t3vm03 (test WN) and obsolete t3vmbdii (off anyhow)
    • t3wn08:
      • t3vobox (active CMS Phedex service)
      • t3se02, t3fs12 (dcache testing env)
      • t3jstart (solaris jumpstart. off)
      • t3vm04 (obsolete solaris testing machine, off)
  6. Shut down NFS servers t3fs06 and t3fs05 NOT NECESSARY Show Hide
    t3fs06# showmount -a | sort
    192.33.123.200:/shome
    192.33.123.209:/shome
    loghost:/shome
    t3ce.psi.ch:/shome
    t3ce01.psi.ch:/shome
    t3ce02.psi.ch:/shome
    t3cmsvobox02.psi.ch:/shome
    t3dcachedb01.psi.ch:/shome
    t3ldap01.psi.ch:/shome/martinelli_f
    t3mon01.psi.ch:/shome
    t3nagios.psi.ch:/shome/martinelli_f
    t3nfs01.psi.ch:/shome
    t3se02.psi.ch:/shome/martinelli_f
    t3ui01.psi.ch:/shome
    t3ui02.psi.ch:/shome
    t3ui03.psi.ch:/shome
    t3ui04.psi.ch:/shome
    t3ui05.psi.ch:/shome
    t3ui06.psi.ch:/shome
    t3ui07.psi.ch:/shome
    t3vm01.psi.ch:/shome
    t3vm03.psi.ch:/shome
    t3vmmaster01.psi.ch:/vmshare
    t3wn02.psi.ch:/shome
    t3wn03.psi.ch:/shome
    t3wn04.psi.ch:/shome
    t3wn08.psi.ch:/shome
    t3wn08.psi.ch:/vmshare
    t3wn10.psi.ch:/shome
    t3wn11.psi.ch:/shome
    t3wn12.psi.ch:/shome
    t3wn13.psi.ch:/shome
    t3wn14.psi.ch:/shome
    t3wn15.psi.ch:/shome
    t3wn16.psi.ch:/shome
    t3wn17.psi.ch:/shome
    t3wn18.psi.ch:/shome
    t3wn19.psi.ch:/shome
    t3wn20.psi.ch:/shome
    t3wn21.psi.ch:/shome
    t3wn22.psi.ch:/shome
    t3wn23.psi.ch:/shome
    t3wn24.psi.ch:/shome
    t3wn25.psi.ch:/shome
    t3wn26.psi.ch:/shome
    t3wn27.psi.ch:/shome
    t3wn28.psi.ch:/shome
    t3wn29.psi.ch:/shome
    
  7. Powering systems off for the yearly maintenance power break in the compute center
    1. The admin node must stay on!
    2. The file servers (NFS + dcache pools) are allowed to stay on! We do that to make it easier on the disks.
    3. Shut down worker nodes
    4. Shut down UIs
    5. Turn off dcache services (q.v. StartStopDcache215)

Downtime end: Starting up of the systems

Fabio proposes to exploit this downtime to:

  1. migrate LDAP from t3admin01 to t3ldap01 because t3admin01 is out of warranty.
    1. On AFS /etc/ldap.conf has been modified to point to t3ldap01, so a Puppet run will swap the LDAP source on UIs and WNs.
  2. migrate GANGLIA from t3ce01 to t3mon01, ganglia sw already installed on t3mon01
  3. Apply quota to /tmp and /scratch on Uis and Wns
    1. Puppet profile + sw already prepared, please look /afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/quota-fs-users-in-ldap, again a Puppet run will make the change.
  4. Convert DHCP IPs in fixed IPs ?
    1. Eventually we decided to apply a default 7 days lease to skip this step and to mitigate a client lost IP event.
  5. Upgrade kernels
  6. Start Nagios process

Upgrades

Server Kernel Puppet /scratch fs
t3ui02 Y Y xfs
t3ui03 Y Y ext3
t3ui04 Y Y ext3
t3ui05 Y Y ext3
t3ui06 Y Y ext3
t3ui07 Y Y xfs
t3wn10 Y Y ext3
t3wn11 Y Y ext3
t3wn12 Y Y ext3
t3wn13 Y Y ext3
t3wn14 Y Y ext3
t3wn15 Y Y ext3
t3wn16 Y Y ext3
t3wn17 Y Y ext3
t3wn18 Y Y ext3
t3wn19 Y Y ext3
t3wn20 Y Y ext3
t3wn21 Y Y ext3
t3wn22 Y Y ext3
t3wn23 Y Y ext3
t3wn24 Y Y ext3
t3wn25 Y Y ext3
t3wn26 Y Y ext3
t3wn27 Y Y ext3
t3wn28 Y Y ext3
t3wn29 Y Y ext3
t3mon01 Y Y n.a.
UpgradePlanningForm
Title Scheduled PSI downtime
Summary Yearly maintenance work to be done in the PSI computing center
Target Date 06. 01. 2012
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r15 - 2016-06-08 - FabioMartinelli
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback