Tags: view all tags

Scheduled Maintenance on 2012-04-18

CSCS is moving to a new compute center in Lugano, and we will go into Scheduled Downtime for three weeks. This is the planned schedule (local Swiss time):

16th of April, 8:00 - Closure of LHCb queue
17th of April, 8:00 - Closure of ATLAS and CMS queues (All already-submitted jobs are allowed to run after that)
18th of April, 8:00 - Shut down all machines, and prepare for transport
19th of April all day - Transport all racks
9th of May - Opening of all queues

Summary of interventions

We will perform the following operations on the cluster:

Ensure new cabling will be right
Backup important data
Shut down and uncable racks
Configure the production network
Relocate / Start Service Nodes
Relocate / Start Storage Nodes
Relocate / Start Compute Nodes
Relocate / Start Scratch Nodes
Service Challenge

Ensure new cabling will be right

Description: Draw schemas, make pictures, and print lables for necesary places where cabling is complicated
Notes:
- Print labels for SAS cables behind OSS's
- Print labels for loopback switches between OSSs (not used!!)
- Draw schema behind IBM enclosures (right: top-down, left: bottom-up. Output on the right-top, input on the left, middle empty)
- Check hosts with multiple ethernet ports used. (storage01/02, mds1/2, oss11/42, none used)

Backup important data

Description:
- Virtual machine images should be backed up.
- Check there is also proper backup of the rest of the components
- Also copy necessary stuff (skeleton) from Scratch (wiki page with instructions)

Shut down and uncable racks

Description: After all backups are done, we shut down the machines and remove all cables that go from one rack to another. Everything that stays inside the rack can be left there.
Notes: We have one day, so, if there is more time, maybe we could start removing old network / power cables that are not going to be used in the new building.
- Remove Force10 switches from SunBlades
- Print labels for all equipment
- Check boxes with material

Configure the production network

Description: Install and configure both the Ethernet switches and the Infiniband.
Notes:
- Infiniband: There will be 2 root switches (those with the Ethernet Bridge: 4036E) in the top of Rack 9, and 5 leaf switches (the older one goes to Rack 4). Three uplinks from every leaf switch to each root switch (covering from ports 31-36 on each of the leafs)
- Ethernet: We will configure a stack of 8 Force10 switches together, with each switch with ports 1-24 to VLAN_64 and 25-48 to VLAN_ILOM. There will be one 10G uplink with the XFP port behind the closest one to the Cisco fabric extender. If possible, the stack should be connected with a Ring topology.

Relocate / Start Service Nodes

Description: Move Service Nodes from the old racks to the new locations, start them, remove the VNIC, and check status.
Affected nodes:
- puppet (+MDT)
- blackbox
- nfs01
- nfs02
- kvm01 + 10GbE card + eth1 status + guests (change ethernet cards to virtio)
- xen11 + guests
- xen13 + guests
- xen15 + guests
- cream01
- cream02
Notes: After preparing the VM Physical hosts, we need to start and prepare the virtual hosts.
- Replace certificates for argus01-02
- Start the BDIIs, ui64 and pub, at least.

Relocate / Start Storage Nodes

Description: Move Storage Nodes from the old racks to the new locations, start them, remove the VNIC, and check status.
Affected nodes:
- storage01
- storage02
- ibm[01-04] controller + enclosures
- se[30-39]
- se[01-04] - Renamed from se[40-43]
- se[05-08]
Notes: Afterwards we need to do the following:
- The IBM IO Servers need an ILOM firmware upgrade.
- dCache is supposed to work with the 10.10 up. Try and check it.
- Check the service is up, published in the BDII, and being used
- Announce the VO-Reps that the service is back

Relocate / Start Compute Nodes

Description: Move Compute Nodes from the old racks to the new locations, rename and reinstall them.
Affected nodes: wn[197-206] renamed to wn[01-10]
Notes: Make sure there is 64 GB Swap on all nodes

Relocate / Start Scratch Nodes

Description: Move Scratch Nodes from the old racks to the new locations, start them, remove the VNIC, and check status.
Affected nodes:
- mds1
- mds2
- ost[11-44]
- oss[11-42]
Notes: After having it ready, we need to rebuild GPFS:
- Ensure there are no rests of /etc/hosts *.ib.lcg.cscs hosts
- Re-create the cluster using only IB connections, and 4 + 2 failure groups (instead of 8 + 2)
- Copy the skeleton
- Check performance

Service Challenge

Description: After all machines are up and configured, we need to fill up the cluster with jobs, copying to/from the SE / Scratch, to check everything is fine.
Notes: Possibly, if there is time, do a Linpack on the nodes, to have an estimate of the cluster's performance in general terms.

More things to check:

All IB links have the right negotiated speed.
Everything works fine with the Force10 stack down.
All nagios checks are OK
Ganglia shows all graphs
CMS pilots pool-accounts creation (if possible)

After this, open the queues and send the announcement

Edit | Attach | ~~Watch~~ | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | Raw edit | More topic actions

Topic revision: r8 - 2012-05-07 - PabloFernandez

LCGTier2

Log In

(Topic)

Home
LCGTier2 Web
- Users
- Groups
- Index
- Search
- Changes
- Notifications
- RSS Feed
- Statistics
- Preferences
P
View
Edit

Warning: Can't find topic "".""

Account
- Log In

Edit
Attach

Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback