CMS Tier-3 Upgrade Planning Page

phase B restructuring

Summary

dcache upgrade to 1.9.5-16. Test evacuation of old X4500 pools. Research slow t3fs05 transfer speeds

dCache upgrade to 1.9.5-16

stop phedex
UI: Prevent user login and reboot to get rid of all logged in users
We may want to kill all running jobs on the nodes (but we also can just let them run and fail)
Stop dcache

Make a backup of the postgres DB

time pg_dumpall -U postgres > dcachedb01-dbbackup-20100311.bup

real    0m18.158s
user    0m0.614s
sys     0m3.094s

Make a backup of the current installation: one for t3se01 and one for a Thumper (t3fs05)

ssh t3se01 mv /opt/d-cache /opt/d-cache-1.9.2-5
ssh t3dcachedb01 mv /opt/d-cache /opt/d-cache-1.9.2-5
cexec fs: mv /opt/d-cache /opt/d-cache-1.9.2-5

t3se01 upgrade
1. install the RPM
2. Put the configuration files in place and check them
  - /opt/d-cache/config/dCacheSetup
  - /opt/d-cache/etc/node_config
3. DON NOT FORGET TO RUN install.sh!!!
t3dcachedb01 upgrade
1. install the RPM
2. Put the configuration files in place and check them
  - /opt/d-cache/config/dCacheSetup
  - /opt/d-cache/etc/node_config
  - /opt/d-cache/etc/dcachesrm-gplazma.policy
  - /etc/grid-security/grid-vorolemap
  - /etc/grid-security/storage-authzdb
  - /opt/d-cache/etc/glue-1.3.xml (Info System)
3. DON NOT FORGET TO RUN install.sh!!!
4. Check whether the pools are correctly found by the init scripts
```
cexec fs: /opt/d-cache/bin/dcache pool ls 
```
Fileserver upgrade
1. Install the Solaris packages on the File servers
```
cexec fs: pkgrm -n dCache
ssh t3fs01 pkgadd -n dcache-server-1.9.5-16.pkg   # regrettably needs one interactive answer on each server
...
```
2. Put the fileserver configuration files in place and check them
  - /opt/d-cache/config/dCacheSetup Note that dCacheSetup may require a different java location on the fileservers
  - /opt/d-cache/etc/node_config
3. DON NOT FORGET TO RUN install.sh!!!

Confirm that the versions are correct everywhere

ssh t3se01 /opt/d-cache/bin/dcache version
ssh t3dcachedb01 /opt/d-cache/bin/dcache version
cexec fs: /opt/d-cache/bin/dcache version

Start dcache on t3se01 and dcachedb01
- Check whether the cells come up correctly
Start dcache on a single pool
- Check services using our testing script
Start remaining pools
Investigate whether the Info system is still running ok: the format of the /opt/d-cache/etc/glue-1.3.xml file had changed quite a bit. Prepared the new one.

test on t3fs05 filesystem to find bottleneck

Pool migration from t3fs05 to a new Thor

List of open tasks

Virtual machine infrastructure
- install a semi-permanent vmware-server host (t3wn08 has 1 broken NIC port. Should probably free this machine for repairs)
- test running VMs over NFS with the images residing on ZFS on a thumper (t3fs06?)
- migrate all virtual machines to this new installation DOWNTIME
File servers and dCache
- find solution to upgrade problem
- prepare new Thors for dcache
  - make standard configuration procedure where puppet takes over most of the config. We cannot do a full puppet host install, since there is no coupling between the JumpStart and puppet
  - setup raidz2 ZFS structure for the pools
  - install dCache and bring the Thors online with writes disabled
- migrate the data to the new pools to free up servers
- Reinstall the old thumpers through Jumpstart, standard config + puppet, so that we have everywhere the same Solaris version and a raidz2 configuration
- Migrate dcache to Chimera
Home directories
- implement daily snapshots of shome on t3fs06 (cron based script, delete older snapshots)
- implement incremental snapshot transfers to a backup server
Services
- convert the VM t3ui02 to a real physical machine (let's take t3wn01)
- Setup a new VM for the VO-Box (mostly phedex... I think that frontier should stay on a phys host with local HD)
Lesser priority
- LDAP direcetory service
  - should we move that onto a VM? Is the admin host indeed a good place for this? backup and failback? This is a critical system
  - make use of the new extension fields, so that the can be used in automated scripts
- Attach system to NAGIOS (maybe a project for a practicum student?)

UpgradePlanningForm
Title	phase B restructuring
Summary	dcache upgrade to 1.9.5-16. Test evacuation of old X4500 pools. Research slow t3fs05 transfer speeds
Target Date	11-12. 03. 2010

Topic revision: r4 - 2010-03-15 - DerekFeichtinger

CmsTier3

User Pages
Main Page
Policies

Physics Groups
Steering Board Meetings

Admin Pages
AdminArea
Cluster Specs