SiteMaintenance20111107 < LCGTier2

<!-- keep this as a security measure:
   * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup
   * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup
   #uncomment this if you want the page only be viewable by the internal people
   #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup
-->

---+!! Scheduled Maintenance on 2011-11-07, at 8:00am

Next Monday 7th of November we are going to migrate from Lustre to GPFS. This requires a full compute shutdown, and will take us one full day. We have reserved a second day in case something goes bad, but, as usual, we will finish the downtime as soon as everything works.

Central storage (dCache) will not be affected.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

---++!! Summary of interventions

We will perform the following operations on the cluster:

%TOC%

---

---++ %ICON{done}% Backup Lustre skeleton

   * *Description*: Backup previous scratch directories into a Tar file, to be able to do further fast recoveries.
   * *Affected nodes*: wn[101-206], arc[01-02], cream[01-02]
*Steps*:
   * Make sure no job is running
   * Stop grid-service in all WNs and CEs
   * Clean up all data/job directories
   * Do the tar and keep it safe

---++ %ICON{done}% Prepare GPFS Servers

   * *Description*: Do a fresh cleanup and preparation of hardware on all GPFS nodes
   * *Affected nodes*: mds[1-2], oss[11-42], puppet
*Steps*:
   * Remove all Virident cards from Puppet and Oss12
   * Install Virident cards and remove MDT controllers from mds[1-2]
   * Reinstall mds[1-2] with SL6.1
   * Upgrade virident cards/software
   * Upgrade lsi controllers/cards on oss[21-42]
   * Deactivate 1/2 of the CPUs on all GPFS service nodes.
   * Reinstall OSS[11-42] to SL6.1
   * Install GPFS rpms, 3.4.0
   * Upgrade to GPFS rpms 3.4.0-8
   * Compile gpl compatibility layer, install those rpms
   * Run jbod-naming-scheme.sh to create udev rules
   * Reboot servers to ensure proper naming system
   * 1st install Client rpms, then -Make GPFS cluster
   * Make GPFS filesystem
   * Place monitoring cron jobs for broken disks

---++ %ICON{done}% Prepare GPFS Clients

   * *Description*: Install GPFS kernel modules on all clients
   * *Affected nodes*: wn[101-206], arc[01-02], cream[01-02]
   * *Notes*: This may require kernel changes and consequent reboots
*Steps:*
   * Install OFED 1.5.3
   * Install GPFS 3.4.0-0 rpms, upgrade to 3.4.0-8
   * Compile and install GPFS gpl compatibility layer

---++ %ICON{done}% Apply fixes to CREAM

   * *Description*: Apply the following updates and fixes to CREAM-CEs <verbatim>- Increase the number of pool accounts for LHCB VO.
- Apply tomcat5 memory tweaks.
- Move DNS entry for cream02 to IP in the infiniband network.
- Update UMD packages in CREAM machines.</verbatim>
   * *Affected nodes*: cream[01-02], wn[101-206]
   * *Notes*:

---++ %ICON{done}% Update ARGUS

   * *Description*: Apply the following software updates to ARGUS servers and *clean policies* <verbatim>argus-pap
argus-pdp
argus-pep-server
emi-argus
emi-version
yaim-argus_server
argus-pdp-pep-common
argus-pep-common
emi-trustmanager
emi-trustmanager-axis</verbatim>
   * *Affected nodes*: argus[01-02]
   * *Notes*:

---++ %ICON{todo}% Configure ILOM on both NFS servers

   * *Description*:
   * *Affected nodes*: nfs[01-02]
   * *Notes*: nfs01 is done, nfs02 is missing an ILOM card, needs to be purchased.
This topic: LCGTier2 > WebHome > MaintenancesBoard > SiteMaintenance20111107
Topic revision: r11 - 2011-11-09 - JasonTemple