Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.LCGAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.LCGAdminGroup #uncomment this if you want the page only be viewable by the internal people #* Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.LCGAdminGroup --> ---+!! Scheduled Maintenance on 2012-04-18 CSCS is moving to a new compute center in Lugano, and we will go into Scheduled Downtime for three weeks. This is the planned schedule (local Swiss time): * 16th of April, 8:00 - Closure of LHCb queue * 17th of April, 8:00 - Closure of ATLAS and CMS queues (All already-submitted jobs are allowed to run after that) * 18th of April, 8:00 - Shut down all machines, and prepare for transport * 19th of April all day - Transport all racks * 9th of May - Opening of all queues ---++!! Summary of interventions We will perform the following operations on the cluster: %TOC% ---++ Ensure new cabling will be right %ICON{done}% * *Description*: Draw schemas, make pictures, and print lables for necesary places where cabling is complicated * *Notes*: * Print labels for SAS cables behind OSS's %ICON{done}% * Print labels for loopback switches between OSSs %ICON{done}%(not used!!) * Draw schema behind IBM enclosures %ICON{done}%(right: top-down, left: bottom-up. Output on the right-top, input on the left, middle empty) * Check hosts with multiple ethernet ports used. %ICON{done}% (storage01/02, mds1/2, oss11/42, none used) ---++ Backup important data %ICON{done}% * *Description*: * Virtual machine images should be backed up. * Check there is also proper backup of the rest of the components %ICON{done}% * Also copy necessary stuff (skeleton) from Scratch %ICON{done}% (wiki page with instructions) ---++ Shut down and uncable racks %ICON{done}% * *Description*: After all backups are done, we shut down the machines and remove all cables that go from one rack to another. Everything that stays inside the rack can be left there. * *Notes*: We have one day, so, if there is more time, maybe we could start removing old network / power cables that are not going to be used in the new building. * Remove Force10 switches from SunBlades %ICON{done}% * Print labels for all equipment * Check boxes with material %ICON{done}% ---++ Configure the production network %ICON{done}% * *Description*: Install and configure both the Ethernet switches and the Infiniband. * *Notes*: * Infiniband: There will be 2 root switches (those with the Ethernet Bridge: 4036E) in the top of Rack 9, and 5 leaf switches (the older one goes to Rack 4). Three uplinks from every leaf switch to each root switch (covering from ports 31-36 on each of the leafs) %ICON{done}% * Ethernet: We will configure a stack of 8 Force10 switches together, with each switch with ports 1-24 to VLAN_64 and 25-48 to VLAN_ILOM. There will be one 10G uplink with the XFP port behind the closest one to the Cisco fabric extender. If possible, the stack should be connected with a Ring topology. %ICON{done}% ---++ Relocate / Start Service Nodes %ICON{done}% * *Description*: Move Service Nodes from the old racks to the new locations, start them, *remove the VNIC*, and check status. * *Affected nodes*: * puppet (+MDT) %ICON{done}% * blackbox %ICON{done}% * nfs01 %ICON{done}% * nfs02 %ICON{done}% * kvm01 + 10GbE card + eth1 status + guests (change ethernet cards to virtio) %ICON{done}% * xen11 + guests %ICON{done}% * xen13 + guests %ICON{done}% * xen15 + guests %ICON{done}% * cream01 %ICON{done}% * cream02 %ICON{done}% * *Notes*: After preparing the VM Physical hosts, we need to start and prepare the virtual hosts. * Replace certificates for argus01-02 %ICON{done}% * Start the BDIIs, ui64 and pub, at least. %ICON{done}% ---++ Relocate / Start Storage Nodes %ICON{done}% * *Description*: Move Storage Nodes from the old racks to the new locations, start them, *remove the VNIC*, and check status. * *Affected nodes*: * storage01 %ICON{done}% * storage02 %ICON{done}% * ibm[01-04] controller + enclosures %ICON{done}% * se[30-39] %ICON{done}% * se[01-04] - Renamed from se[40-43] %ICON{done}% * se[05-08] %ICON{done}% * *Notes*: Afterwards we need to do the following: * The IBM IO Servers need an ILOM firmware upgrade. %ICON{done}% * dCache is supposed to work with the 10.10 up. Try and check it. %ICON{done}% * Check the service is up, published in the BDII, and being used %ICON{done}% * Announce the VO-Reps that the service is back %ICON{done}% ---++ Relocate / Start Compute Nodes %ICON{done}% * *Description*: Move Compute Nodes from the old racks to the new locations, *rename and reinstall* them. %ICON{done}% * *Affected nodes*: wn[197-206] renamed to wn[01-10] %ICON{done}% * *Notes*: Make sure there is 64 GB Swap on all nodes %ICON{done}% ---++ Relocate / Start Scratch Nodes %ICON{done}% * *Description*: Move Scratch Nodes from the old racks to the new locations, start them, *remove the VNIC*, and check status. * *Affected nodes*: * mds1 %ICON{done}% * mds2 %ICON{done}% * ost[11-44] %ICON{done}% * oss[11-42] %ICON{done}% * *Notes*: After having it ready, we need to rebuild GPFS: * Ensure there are no rests of /etc/hosts *.ib.lcg.cscs hosts %ICON{done}% * Re-create the cluster using only IB connections, and 4 + 2 failure groups (instead of 8 + 2) %ICON{done}% * Copy the skeleton %ICON{done}% * Check performance %ICON{done}% ---++ Service Challenge %ICON{done}% * *Description*: After all machines are up and configured, we need to fill up the cluster with jobs, copying to/from the SE / Scratch, to check everything is fine. * *Notes*: Possibly, if there is time, do a Linpack on the nodes, to have an estimate of the cluster's performance in general terms. *More things to check:* * All IB links have the right negotiated speed. %ICON{done}% * Everything works fine with the Force10 stack down. %ICON{done}% * All nagios checks are OK %ICON{done}% * Ganglia shows all graphs %ICON{done}% * CMS pilots pool-accounts creation (if possible) %ICON{done}% *After this, open the queues and send the announcement*
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r8
<
r7
<
r6
<
r5
<
r4
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r8 - 2012-05-07
-
PabloFernandez
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback