Scheduled Maintenance on 2012-04-18
CSCS is moving to a new compute center in Lugano, and we will go into Scheduled Downtime for three weeks. This is the planned schedule (local Swiss time):
- 16th of April, 8:00 - Closure of LHCb queue
- 17th of April, 8:00 - Closure of ATLAS and CMS queues (All already-submitted jobs are allowed to run after that)
- 18th of April, 8:00 - Shut down all machines, and prepare for transport
- 19th of April all day - Transport all racks
- 9th of May - Opening of all queues
Summary of interventions
We will perform the following operations on the cluster:
Ensure new cabling will be right
- Description: Draw schemas, make pictures, and print lables for necesary places where cabling is complicated
- Notes:
- Print labels for SAS cables behind OSS's
- Print labels for loopback switches between OSSs (not used!!)
- Draw schema behind IBM enclosures (right: top-down, left: bottom-up. Output on the right-top, input on the left, middle empty)
- Check hosts with multiple ethernet ports used. (storage01/02, mds1/2, oss11/42, none used)
Backup important data
- Description:
- Virtual machine images should be backed up.
- Check there is also proper backup of the rest of the components
- Also copy necessary stuff (skeleton) from Scratch (wiki page with instructions)
Shut down and uncable racks
- Description: After all backups are done, we shut down the machines and remove all cables that go from one rack to another. Everything that stays inside the rack can be left there.
- Notes: We have one day, so, if there is more time, maybe we could start removing old network / power cables that are not going to be used in the new building.
- Remove Force10 switches from SunBlades
- Print labels for all equipment
- Check boxes with material
Configure the production network
- Description: Install and configure both the Ethernet switches and the Infiniband.
- Notes:
- Infiniband: There will be 2 root switches (those with the Ethernet Bridge: 4036E) in the top of Rack 9, and 5 leaf switches (the older one goes to Rack 4). Three uplinks from every leaf switch to each root switch (covering from ports 31-36 on each of the leafs)
- Ethernet: We will configure a stack of 8 Force10 switches together, with each switch with ports 1-24 to VLAN_64 and 25-48 to VLAN_ILOM. There will be one 10G uplink with the XFP port behind the closest one to the Cisco fabric extender. If possible, the stack should be connected with a Ring topology.
Relocate / Start Service Nodes
- Description: Move Service Nodes from the old racks to the new locations, start them, remove the VNIC, and check status.
- Affected nodes:
- puppet (+MDT)
- blackbox
- nfs01
- nfs02
- kvm01 + 10GbE card + eth1 status + guests (change ethernet cards to virtio)
- xen11 + guests
- xen13 + guests
- xen15 + guests
- cream01
- cream02
- Notes: After preparing the VM Physical hosts, we need to start and prepare the virtual hosts.
- Replace certificates for argus01-02
- Start the BDIIs, ui64 and pub, at least.
Relocate / Start Storage Nodes
- Description: Move Storage Nodes from the old racks to the new locations, start them, remove the VNIC, and check status.
- Affected nodes:
- storage01
- storage02
- ibm[01-04] controller + enclosures
- se[30-39]
- se[01-04] - Renamed from se[40-43]
- se[05-08]
- Notes: Afterwards we need to do the following:
- The IBM IO Servers need an ILOM firmware upgrade.
- dCache is supposed to work with the 10.10 up. Try and check it.
- Check the service is up, published in the BDII, and being used
- Announce the VO-Reps that the service is back
Relocate / Start Compute Nodes
- Description: Move Compute Nodes from the old racks to the new locations, rename and reinstall them.
- Affected nodes: wn[197-206] renamed to wn[01-10]
- Notes: Make sure there is 64 GB Swap on all nodes
Relocate / Start Scratch Nodes
- Description: Move Scratch Nodes from the old racks to the new locations, start them, remove the VNIC, and check status.
- Affected nodes:
- mds1
- mds2
- ost[11-44]
- oss[11-42]
- Notes: After having it ready, we need to rebuild GPFS:
- Ensure there are no rests of /etc/hosts *.ib.lcg.cscs hosts
- Re-create the cluster using only IB connections, and 4 + 2 failure groups (instead of 8 + 2)
- Copy the skeleton
- Check performance
Service Challenge
- Description: After all machines are up and configured, we need to fill up the cluster with jobs, copying to/from the SE / Scratch, to check everything is fine.
- Notes: Possibly, if there is time, do a Linpack on the nodes, to have an estimate of the cluster's performance in general terms.
More things to check:
- All IB links have the right negotiated speed.
- Everything works fine with the Force10 stack down.
- All nagios checks are OK
- Ganglia shows all graphs
- CMS pilots pool-accounts creation (if possible)
After this, open the queues and send the announcement