Scheduled Maintenance on 2011-11-07, at 8:00am
Next Monday 7th of November we are going to migrate from Lustre to GPFS. This requires a full compute shutdown, and will take us one full day. We have reserved a second day in case something goes bad, but, as usual, we will finish the downtime as soon as everything works.
Central storage (dCache) will not be affected.
As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.
Summary of interventions
We will perform the following operations on the cluster:
Backup Lustre skeleton
- Description: Backup previous scratch directories into a Tar file, to be able to do further fast recoveries.
- Affected nodes: wn[101-206], arc[01-02], cream[01-02]
Steps:
- Make sure no job is running
- Stop grid-service in all WNs and CEs
- Clean up all data/job directories
- Do the tar and keep it safe
Prepare GPFS Servers
- Description: Do a fresh cleanup and preparation of hardware on all GPFS nodes
- Affected nodes: mds[1-2], oss[11-42], puppet
Steps:
- Remove all Virident cards from Puppet and Oss12
- Install Virident cards and remove MDT controllers from mds[1-2]
- Reinstall mds[1-2] with SL6.1
- Upgrade virident cards/software
- Upgrade lsi controllers/cards on oss[21-42]
- Deactivate 1/2 of the CPUs on all GPFS service nodes.
- Reinstall OSS[11-42] to SL6.1
- Install GPFS rpms, 3.4.0
- Upgrade to GPFS rpms 3.4.0-8
- Compile gpl compatibility layer, install those rpms
- Run jbod-naming-scheme.sh to create udev rules
- Reboot servers to ensure proper naming system
- 1st install Client rpms, then -Make GPFS cluster
- Make GPFS filesystem
- Place monitoring cron jobs for broken disks
Prepare GPFS Clients
- Description: Install GPFS kernel modules on all clients
- Affected nodes: wn[101-206], arc[01-02], cream[01-02]
- Notes: This may require kernel changes and consequent reboots
Steps:
- Install OFED 1.5.3
- Install GPFS 3.4.0-0 rpms, upgrade to 3.4.0-8
- Compile and install GPFS gpl compatibility layer
Apply fixes to CREAM
Update ARGUS
Configure ILOM on both NFS servers
- Description:
- Affected nodes: nfs[01-02]
- Notes: nfs01 is done, nfs02 is missing an ILOM card, needs to be purchased.