Scheduled Maintenance on 2011-11-07, at 8:00am

Next Monday 7th of November we are going to migrate from Lustre to GPFS. This requires a full compute shutdown, and will take us one full day. We have reserved a second day in case something goes bad, but, as usual, we will finish the downtime as soon as everything works.

Central storage (dCache) will not be affected.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

Summary of interventions

We will perform the following operations on the cluster:


DONE Backup Lustre skeleton

  • Description: Backup previous scratch directories into a Tar file, to be able to do further fast recoveries.
  • Affected nodes: wn[101-206], arc[01-02], cream[01-02]
Steps:
  • Make sure no job is running
  • Stop grid-service in all WNs and CEs
  • Clean up all data/job directories
  • Do the tar and keep it safe

TODO Prepare GPFS Servers

  • Description: Do a fresh cleanup and preparation of hardware on all GPFS nodes
  • Affected nodes: mds[1-2], oss[11-42], puppet
Steps:
  • Remove all Virident cards from Puppet and Oss12
  • Install Virident cards and remove MDT controllers from mds[1-2]
  • Reinstall mds[1-2] with SL6.1
  • Upgrade virident cards/software
  • Upgrade lsi controllers/cards on oss[21-42]
  • Deactivate 1/2 of the CPUs on all GPFS service nodes.
  • Reinstall OSS[11-42] to SL6.1
  • Install GPFS rpms, 3.4.0
  • Upgrade to GPFS rpms 3.4.0-8
  • Compile gpl compatibility layer, install those rpms
  • Run jbod-naming-scheme.sh to create udev rules
  • Reboot servers to ensure proper naming system
  • 1st install Client rpms, then -Make GPFS cluster
  • Make GPFS filesystem
  • Place monitoring cron jobs for broken disks

TODO Prepare GPFS Clients

  • Description: Install GPFS kernel modules on all clients
  • Affected nodes: wn[101-206], arc[01-02], cream[01-02]
  • Notes: This may require kernel changes and consequent reboots
Steps:
  • Install OFED 1.5.3
  • Install GPFS 3.4.0-0 rpms, upgrade to 3.4.0-8
  • Compile and install GPFS gpl compatibility layer

DONE Apply fixes to CREAM

  • Description: Apply the following updates and fixes to CREAM-CEs
    - Increase the number of pool accounts for LHCB VO.
    - Apply tomcat5 memory tweaks.
    - Move DNS entry for cream02 to IP in the infiniband network.
    - Update UMD packages in CREAM machines.
  • Affected nodes: cream[01-02], wn[101-206]
  • Notes:

DONE Update ARGUS

  • Description: Apply the following software updates to ARGUS servers and clean policies
    argus-pap
    argus-pdp
    argus-pep-server
    emi-argus
    emi-version
    yaim-argus_server
    argus-pdp-pep-common
    argus-pep-common
    emi-trustmanager
    emi-trustmanager-axis
  • Affected nodes: argus[01-02]
  • Notes:

TODO Configure ILOM on both NFS servers

  • Description:
  • Affected nodes: nfs[01-02]
  • Notes: nfs01 is done, nfs02 is missing an ILOM card, needs to be purchased.
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r10 - 2011-11-09 - MiguelGila
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback