Scheduled Maintenance on 2012-11-14

The next second working Wednesday of the month we will go into Scheduled Downtime. This time the operations will take us longer (two days), so it will last from 9:00 (14/11) to 18:00 (15/11), but as always we will return to operation as soon as we finish and feel confident with the changes.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

_ REMOVE: REMEMBER TO ADD DOWNTIME IN GOCGB and CLOSE THE QUEUES_

Summary of interventions

We will perform the following operations on the cluster:


Upgrade dCache from 1.9.5 to 1.9.12-21

  • Description: This is the most important operation of the maintenance. It includes a few extra changes that make the operation a bit complex.
  • Affected nodes: storage[01-02], se[01-08,11-12]
  • Notes: The following steps should be taken:
    • Do a full backup of the head nodes after stopping the service
    • Perform a Java upgrade
    • Perform a PostgreSQL version upgrade to 8.4 (includes a dump/restore of all DDBB)
    • Upgrade dCache and run migration scripts
    • Add xRootdDomain to storage01
    • Check BDII and lcg-tools
    • Check Nagios, Ganglia and FZK monitoring
  • OS upgrade (to SL6) will NOT be performed, to avoid increasing the complexity of the operation.

Move Nexus 2232 under both Phoenix's 5548 ethernet switches.

  • Description: Currently the 10 GbE switch is running directly under CSCS network, and needs to be changed to be under Phoenix-specific Ethernet root switches.
  • Affected nodes: All nodes (Internet connection, and all 10 GbE machines, like all virtual machines)
  • Notes: Operation will be performed jointly with our network manager.

Voltair IB 4036 & 4036E switches firmware upgrade

  • Description: There is a number of bugfixes in the new 3.9.1 version that need to be applied to all CSCS switches, including us.
  • Affected nodes: All
  • Notes: This will be performed first in one of the swib3 4036E switch (with the other shut down) and then checked for some time that all components work well, before performing the upgrade in the rest of the switches.

Recable IB

  • Description: The core switches (core9a and core9b) have hosts (the new ones, essentially compute nodes) connected to it. They need to be moved to the recently installed swib8.
  • Affected nodes: wn[48-69]
  • Notes: Also swib8 needs to be connected to both core switches.

Exchange wn47 with wn[39-46]

  • Description: The later nodes make Rack11 to go beyond 10 KW of power consumption, and are currently powered to Rack10 (with long power cords). Best would be to avoid inter-rack cables, and therefore move them.
  • Affected nodes:
  • Notes: wn47 takes the same space, but it's in Rack9. Also, we need to re-label the necessary cables.
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r1 - 2012-10-31 - PabloFernandez
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback