create new tag
view all tags

Scheduled Maintenance on 2012-11-14

The next second working Wednesday of the month we will go into Scheduled Downtime. This time the operations will take us longer (two days), so it will last from 9:00 (14/11) to 18:00 (15/11), but as always we will return to operation as soon as we finish and feel confident with the changes.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

Summary of interventions

We will perform the following operations on the cluster:

Upgrade dCache from 1.9.5 to 1.9.12-21

  • Description: This is the most important operation of the maintenance. It includes a few extra changes that make the operation a bit complex.
  • Affected nodes: storage[01-02], se[01-08,11-12]
  • Notes: The following steps should be taken:
    • Do a full backup of the head nodes after stopping the service
    • Perform a Java upgrade
    • Perform a PostgreSQL version upgrade to 8.4 (includes a dump/restore of all DDBB)
    • Upgrade dCache and run migration scripts
    • Add xRootdDomain to storage01
    • Check BDII and lcg-tools
    • Check Nagios, Ganglia and FZK monitoring
  • OS upgrade (to SL6) will NOT be performed, to avoid increasing the complexity of the operation.

Move Nexus 2232 under both Phoenix's 5548 ethernet switches.

  • Description: Currently the 10 GbE switch is running directly under CSCS network, and needs to be changed to be under Phoenix-specific Ethernet root switches.
  • Affected nodes: All nodes (Internet connection, and all 10 GbE machines, like all virtual machines)
  • Notes: Operation will be performed jointly with our network manager.

Voltair IB 4036 & 4036E switches firmware upgrade

  • Description: There is a number of bugfixes in the new 3.9.1 version that need to be applied to all CSCS switches, including us.
  • Affected nodes: All
  • Notes: This will be performed first in one of the swib3 4036E switch (with the other shut down) and then checked for some time that all components work well, before performing the upgrade in the rest of the switches.

Recable some IB nodes

  • Description: The core switches (core9a and core9b) have hosts (the new ones, essentially compute nodes) connected to it. They need to be moved to the recently installed swib8.
  • Affected nodes: wn[48-69].
  • Notes: Also swib8 needs to be connected to both core switches.
    • wn[20,22,24,26] need to be re-cabled to avoid mess.

Electrical redistribution of some nodes

  • Description: Some power stripes are unbalanced, and need some offline operation (some online was already done with Facilities team)
  • Affected nodes: wn47, wn[39-46], oss[11-42]
  • Notes:
    • Exchange wn47 with wn[39-46]: The later nodes make Rack11 to go beyond 10 KW of power consumption, and are currently powered to Rack10 (with long power cords), and will be physically exchanged. We also need to re-label the necessary cables.
    • Redistribute the GPFS nodes (oss11-42) to have them consume more evenly among the three phases.

Compute nodes (double twins) firmware update

  • Description: Intel has found a few bugs within some server components (bios, backplanes and power supplies) and we need to update them.
  • Affected nodes: wn[11-46], wn[48-59]
  • Notes: Firmware still to be shipped from Dalco. We will have a node with the new firmware ready a few days before the upgrade, to test it.
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r4 - 2012-11-09 - PabloFernandez
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback