Tags:
create new tag
view all tags

CMS Tier-3 Upgrade Planning Page

phase B restructuring

Summary

dcache upgrade to 1.9.5-16. Test evacuation of old X4500 pools. Research slow t3fs05 transfer speeds

dCache upgrade to 1.9.5-16

  1. stop phedex
  2. UI: Prevent user login and reboot to get rid of all logged in users
  3. We may want to kill all running jobs on the nodes (but we also can just let them run and fail)
  4. Stop dcache
  5. Make a backup of the postgres DB
    time pg_dumpall -U postgres > dcachedb01-dbbackup-20100311.bup
    
    real    0m18.158s
    user    0m0.614s
    sys     0m3.094s
    
  6. Make a backup of the current installation: one for t3se01 and one for a Thumper (t3fs05)
    ssh t3se01 mv /opt/d-cache /opt/d-cache-1.9.2-5
    ssh t3dcachedb01 mv /opt/d-cache /opt/d-cache-1.9.2-5
    cexec fs: mv /opt/d-cache /opt/d-cache-1.9.2-5
    
  7. t3se01 upgrade
    1. install the RPM
    2. Put the configuration files in place and check them
      • /opt/d-cache/config/dCacheSetup
      • /opt/d-cache/etc/node_config
    3. DON NOT FORGET TO RUN install.sh!!!
  8. t3dcachedb01 upgrade
    1. install the RPM
    2. Put the configuration files in place and check them
      • /opt/d-cache/config/dCacheSetup
      • /opt/d-cache/etc/node_config
      • /opt/d-cache/etc/dcachesrm-gplazma.policy
      • /etc/grid-security/grid-vorolemap
      • /etc/grid-security/storage-authzdb
      • /opt/d-cache/etc/glue-1.3.xml (Info System)
    3. DON NOT FORGET TO RUN install.sh!!!
    4. Check whether the pools are correctly found by the init scripts
      cexec fs: /opt/d-cache/bin/dcache pool ls 
  9. Fileserver upgrade
    1. Install the Solaris packages on the File servers
      cexec fs: pkgrm -n dCache
      ssh t3fs01 pkgadd -n dcache-server-1.9.5-16.pkg   # regrettably needs one interactive answer on each server
      ...
      
    2. Put the fileserver configuration files in place and check them
      • /opt/d-cache/config/dCacheSetup Note that dCacheSetup may require a different java location on the fileservers
      • /opt/d-cache/etc/node_config
    3. DON NOT FORGET TO RUN install.sh!!!
  10. Confirm that the versions are correct everywhere
    ssh t3se01 /opt/d-cache/bin/dcache version
    ssh t3dcachedb01 /opt/d-cache/bin/dcache version
    cexec fs: /opt/d-cache/bin/dcache version
    
  11. Start dcache on t3se01 and dcachedb01
    • Check whether the cells come up correctly
  12. Start dcache on a single pool
    • Check services using our testing script
  13. Start remaining pools
  14. Investigate whether the Info system is still running ok: the format of the /opt/d-cache/etc/glue-1.3.xml file had changed quite a bit. Prepared the new one.

test on t3fs05 filesystem to find bottleneck

Pool migration from t3fs05 to a new Thor

List of open tasks

  • Virtual machine infrastructure
    • install a semi-permanent vmware-server host (t3wn08 has 1 broken NIC port. Should probably free this machine for repairs)
    • test running VMs over NFS with the images residing on ZFS on a thumper (t3fs06?)
    • migrate all virtual machines to this new installation DOWNTIME
  • File servers and dCache
    • find solution to upgrade problem DONE
    • prepare new Thors for dcache
      • make standard configuration procedure where puppet takes over most of the config. We cannot do a full puppet host install, since there is no coupling between the JumpStart and puppet
      • setup raidz2 ZFS structure for the pools
      • install dCache and bring the Thors online with writes disabled
    • migrate the data to the new pools to free up servers
    • Reinstall the old thumpers through Jumpstart, standard config + puppet, so that we have everywhere the same Solaris version and a raidz2 configuration
    • Migrate dcache to Chimera
  • Home directories
    • implement daily snapshots of shome on t3fs06 (cron based script, delete older snapshots) DONE
    • implement incremental snapshot transfers to a backup server
  • Services
    • convert the VM t3ui02 to a real physical machine (let's take t3wn01) DONE
    • Setup a new VM for the VO-Box (mostly phedex... I think that frontier should stay on a phys host with local HD)
  • Lesser priority
    • LDAP direcetory service
      • should we move that onto a VM? Is the admin host indeed a good place for this? backup and failback? This is a critical system
      • make use of the new extension fields, so that the can be used in automated scripts
    • Attach system to NAGIOS (maybe a project for a practicum student?)

UpgradePlanningForm
Title phase B restructuring
Summary dcache upgrade to 1.9.5-16. Test evacuation of old X4500 pools. Research slow t3fs05 transfer speeds
Target Date 11-12. 03. 2010
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r4 - 2010-03-15 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback