CMS Tier-3 Upgrade Planning Page
Scheduled PSI downtime
Summary
Yearly maintenance work to be done in the PSI computing center
Details
before downtime: preparation work
- Dowtime Announcement
- to users through mailing list and news section (done already in December)
- to GOCDB (we only were able to do it on Jan 5th)
- Put SGE queues in draining mode according to their runtime limits
Downtime start on 2012-01-06, 15h
- Stop the Nagios process on t3nagios
- Barring access to users and killing all user sessions
- set
/etc/security/access.conf
for all login machines to only allow admin access
- reboot all UIs
- ...
- Shortly before 15:30h switch to the new virtual monitoring node t3mon01
- 1 DNS alias entry will be changed by Mauro at 15:30h
-
Shut down old virtualization environment WILL LEAVE THEM ON
- t3vmmaster01: hosts t3vm03 (test WN) and obsolete t3vmbdii (off anyhow)
- t3wn08:
- t3vobox (active CMS Phedex service)
- t3se02, t3fs12 (dcache testing env)
- t3jstart (solaris jumpstart. off)
- t3vm04 (obsolete solaris testing machine, off)
-
Shut down NFS servers t3fs06 and t3fs05 NOT NECESSARY Show Hide
t3fs06# showmount -a | sort
192.33.123.200:/shome
192.33.123.209:/shome
loghost:/shome
t3ce.psi.ch:/shome
t3ce01.psi.ch:/shome
t3ce02.psi.ch:/shome
t3cmsvobox02.psi.ch:/shome
t3dcachedb01.psi.ch:/shome
t3ldap01.psi.ch:/shome/martinelli_f
t3mon01.psi.ch:/shome
t3nagios.psi.ch:/shome/martinelli_f
t3nfs01.psi.ch:/shome
t3se02.psi.ch:/shome/martinelli_f
t3ui01.psi.ch:/shome
t3ui02.psi.ch:/shome
t3ui03.psi.ch:/shome
t3ui04.psi.ch:/shome
t3ui05.psi.ch:/shome
t3ui06.psi.ch:/shome
t3ui07.psi.ch:/shome
t3vm01.psi.ch:/shome
t3vm03.psi.ch:/shome
t3vmmaster01.psi.ch:/vmshare
t3wn02.psi.ch:/shome
t3wn03.psi.ch:/shome
t3wn04.psi.ch:/shome
t3wn08.psi.ch:/shome
t3wn08.psi.ch:/vmshare
t3wn10.psi.ch:/shome
t3wn11.psi.ch:/shome
t3wn12.psi.ch:/shome
t3wn13.psi.ch:/shome
t3wn14.psi.ch:/shome
t3wn15.psi.ch:/shome
t3wn16.psi.ch:/shome
t3wn17.psi.ch:/shome
t3wn18.psi.ch:/shome
t3wn19.psi.ch:/shome
t3wn20.psi.ch:/shome
t3wn21.psi.ch:/shome
t3wn22.psi.ch:/shome
t3wn23.psi.ch:/shome
t3wn24.psi.ch:/shome
t3wn25.psi.ch:/shome
t3wn26.psi.ch:/shome
t3wn27.psi.ch:/shome
t3wn28.psi.ch:/shome
t3wn29.psi.ch:/shome
- Powering systems off for the yearly maintenance power break in the compute center
- The admin node must stay on!
- The file servers (NFS + dcache pools) are allowed to stay on! We do that to make it easier on the disks.
- Shut down worker nodes
- Shut down UIs
- Turn off dcache services (q.v. StartStopDcache22)
Downtime end: Starting up of the systems
Fabio proposes to exploit this downtime to:
- migrate LDAP from t3admin01 to t3ldap01 because t3admin01 is out of warranty.
- On AFS
/etc/ldap.conf
has been modified to point to t3ldap01, so a Puppet run will swap the LDAP source on UIs and WNs.
- migrate GANGLIA from t3ce01 to t3mon01, ganglia sw already installed on t3mon01
- Apply quota to /tmp and /scratch on Uis and Wns
- Puppet profile + sw already prepared, please look
/afs/psi.ch/service/linux/puppet/var/puppet/environments/DerekDevelopment/modules/quota-fs-users-in-ldap
, again a Puppet run will make the change.
- Convert DHCP IPs in fixed IPs ?
- Eventually we decided to apply a default 7 days lease to skip this step and to mitigate a client lost IP event.
- Upgrade kernels
- Start Nagios process
Upgrades
Server |
Kernel |
Puppet |
/scratch fs |
t3ui02 |
Y |
Y |
xfs |
t3ui03 |
Y |
Y |
ext3 |
t3ui04 |
Y |
Y |
ext3 |
t3ui05 |
Y |
Y |
ext3 |
t3ui06 |
Y |
Y |
ext3 |
t3ui07 |
Y |
Y |
xfs |
t3wn10 |
Y |
Y |
ext3 |
t3wn11 |
Y |
Y |
ext3 |
t3wn12 |
Y |
Y |
ext3 |
t3wn13 |
Y |
Y |
ext3 |
t3wn14 |
Y |
Y |
ext3 |
t3wn15 |
Y |
Y |
ext3 |
t3wn16 |
Y |
Y |
ext3 |
t3wn17 |
Y |
Y |
ext3 |
t3wn18 |
Y |
Y |
ext3 |
t3wn19 |
Y |
Y |
ext3 |
t3wn20 |
Y |
Y |
ext3 |
t3wn21 |
Y |
Y |
ext3 |
t3wn22 |
Y |
Y |
ext3 |
t3wn23 |
Y |
Y |
ext3 |
t3wn24 |
Y |
Y |
ext3 |
t3wn25 |
Y |
Y |
ext3 |
t3wn26 |
Y |
Y |
ext3 |
t3wn27 |
Y |
Y |
ext3 |
t3wn28 |
Y |
Y |
ext3 |
t3wn29 |
Y |
Y |
ext3 |
t3mon01 |
Y |
Y |
n.a. |
This topic: CmsTier3
> WebHome >
AdminArea > UpgradePlanning201201050907
Topic revision: r12 - 2013-05-13 - FabioMartinelli