Scheduled Maintenance on 2012-04-18

CSCS is moving to a new compute center in Lugano, and we will go into Scheduled Downtime for three weeks. This is the planned schedule (local Swiss time):

  • 16th of April, 8:00 - Closure of LHCb queue
  • 17th of April, 8:00 - Closure of ATLAS and CMS queues (All already-submitted jobs are allowed to run after that)
  • 18th of April, 8:00 - Shut down all machines, and prepare for transport
  • 19th of April all day - Transport all racks
  • 9th of May - Opening of all queues

Summary of interventions

We will perform the following operations on the cluster:

Ensure new cabling will be right DONE

  • Description: Draw schemas, make pictures, and print lables for necesary places where cabling is complicated
  • Notes:
    • Print labels for SAS cables behind OSS's DONE
    • Print labels for loopback switches between OSSs DONE(not used!!)
    • Draw schema behind IBM enclosures DONE(right: top-down, left: bottom-up. Output on the right-top, input on the left, middle empty)
    • Check hosts with multiple ethernet ports used. DONE (storage01/02, mds1/2, oss11/42, none used)

Backup important data DONE

  • Description:
    • Virtual machine images should be backed up.
    • Check there is also proper backup of the rest of the components DONE
    • Also copy necessary stuff (skeleton) from Scratch DONE (wiki page with instructions)

Shut down and uncable racks DONE

  • Description: After all backups are done, we shut down the machines and remove all cables that go from one rack to another. Everything that stays inside the rack can be left there.
  • Notes: We have one day, so, if there is more time, maybe we could start removing old network / power cables that are not going to be used in the new building.
    • Remove Force10 switches from SunBlades DONE
    • Print labels for all equipment
    • Check boxes with material DONE

Configure the production network DONE

  • Description: Install and configure both the Ethernet switches and the Infiniband.
  • Notes:
    • Infiniband: There will be 2 root switches (those with the Ethernet Bridge: 4036E) in the top of Rack 9, and 5 leaf switches (the older one goes to Rack 4). Three uplinks from every leaf switch to each root switch (covering from ports 31-36 on each of the leafs) DONE
    • Ethernet: We will configure a stack of 8 Force10 switches together, with each switch with ports 1-24 to VLAN_64 and 25-48 to VLAN_ILOM. There will be one 10G uplink with the XFP port behind the closest one to the Cisco fabric extender. If possible, the stack should be connected with a Ring topology. DONE

Relocate / Start Service Nodes

  • Description: Move Service Nodes from the old racks to the new locations, start them, remove the VNIC, and check status.
  • Affected nodes:
    • puppet (+MDT) DONE
    • blackbox
    • nfs01 DONE
    • nfs02 DONE
    • kvm01 + 10GbE card + eth1 status + guests (change ethernet cards to virtio) DONE
    • xen11 + guests DONE
    • xen13 + guests
    • xen15 + guests DONE
    • cream01 DONE
    • cream02 DONE
  • Notes: After preparing the VM Physical hosts, we need to start and prepare the virtual hosts.
    • Replace certificates for argus01-02 DONE
    • Start the BDIIs, ui64 and pub, at least. DONE

Relocate / Start Storage Nodes DONE

  • Description: Move Storage Nodes from the old racks to the new locations, start them, remove the VNIC, and check status.
  • Affected nodes:
    • storage01 DONE
    • storage02 DONE
    • ibm[01-04] controller + enclosures DONE
    • se[30-39] DONE
    • se[01-04] - Renamed from se[40-43] DONE
    • se[05-08] DONE
  • Notes: Afterwards we need to do the following:
    • The IBM IO Servers need an ILOM firmware upgrade. DONE
    • dCache is supposed to work with the 10.10 up. Try and check it. DONE
    • Check the service is up, published in the BDII, and being used DONE
    • Announce the VO-Reps that the service is back

Relocate / Start Compute Nodes

  • Description: Move Compute Nodes from the old racks to the new locations, rename and reinstall them.
  • Affected nodes: wn[197-206] renamed to wn[01-10]
  • Notes: Make sure there is 64 GB Swap on all nodes

Relocate / Start Scratch Nodes DONE

  • Description: Move Scratch Nodes from the old racks to the new locations, start them, remove the VNIC, and check status.
  • Affected nodes:
    • mds1 DONE
    • mds2 DONE
    • ost[11-44] DONE
    • oss[11-42] DONE
  • Notes: After having it ready, we need to rebuild GPFS:
    • Ensure there are no rests of /etc/hosts *.ib.lcg.cscs hosts DONE
    • Re-create the cluster using only IB connections, and 4 + 2 failure groups (instead of 8 + 2) DONE
    • Copy the skeleton DONE
    • Check performance DONE

Service Challenge

  • Description: After all machines are up and configured, we need to fill up the cluster with jobs, copying to/from the SE / Scratch, to check everything is fine.
  • Notes: Possibly, if there is time, do a Linpack on the nodes, to have an estimate of the cluster's performance in general terms.
More things to check:
  • All IB links have the right negotiated speed.
  • Everything works fine with the Force10 stack down.
  • All nagios checks are OK
  • Ganglia shows all graphs DONE
  • CMS pilots pool-accounts creation (if possible) DONE

After this, open the queues and send the announcement

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2012-05-03 - PabloFernandez
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback