CSCS-LCG2 Tier2 migration, PhaseB to PhaseC

CSCS-LCG2 Tier2 migration, PhaseB to PhaseC

Preamble

This document describes the technical steps in relation to CSCS-LCG2 Tier2 migration, PhaseB to PhaseC along with the related timeline.

Preconditions

Inform Users on migration plan - target: 28/04/2010
Take downtime window for Monday 03/05/2010 at 08:00 to Thursday 06/05/2010 at 12:00. Three+ days.
Execute in advance "downtime" script on ce01 to shut off the queues in time
PhaseC xen hosts in "production-ready" state: xen11, xen12, xen13, xen14, xen15
PhaseC xen virtual machines in "production-ready" state, controlled by cfengine: CE*s, SE*s

Queue closure on PhaseB. Friday 30/04/2010 at 15:00 Swiss time

No more jobs will be accepted at that moment, but running jobs will still continue working.

Decommision PhaseB, preparation on Friday and Monday

Step 1: check all computing queues are stopped/empty
Step 2: check for any pending transfers and stop dCache
Step 3: firewall local storage services (storage01, storage02 & ui)
Step 4: prepare the system for roll-back; ensure back-ups are present etc.
Step 5: Prepare Lustre mounts on all PhaseC nodes and VMs, ensure good state of PhaseC before action starts.

Storage. Monday 03/05/2010 [PF]

Execution

Step 1: Ensure closure of the firewall to external access. Upgrade dCache version to the latest and do at least one restart cycle (this enforces cleanup of deleted and partially transferred files).
Step 2: Perform a dcache-regression Tests (including scripts from Derek and Pablo). Check everything looks sane, then stop dCache service in all PhaseB nodes

Then core nodes will be changed from PhaseB machines to PhaseC. A new dCache instance will be ready on the new machines, and the config files in place, so they need to be renamed (and re-IPd) to work as the other ones. The procedure will be in this order:

Step 3: Make two database dumps from PhaseB storage02: dcache and pnfs.
Step 4: Start pnfs on PhaseB and migrate chimera to PhaseC (leaving pnfs untouched)
Step 5: Restore the dcache database on PhaseC storage01 and point the dCacheConfig file to access the database locally (this also requires to move the PinManager to storage01
Step 6: Shut down network on PhaseB core nodes and exchange names with PhaseC core nodes.

At this step we start chimera and dCache services and test the status (still not open to outside access).

After dCache core services are checked, we start with the Thors.

Step 7: We make all HEP VOs Thumper pools READONLY on the PoolManager, and all Thors READWRITE. Test from ATLAS VO (Fotis or Pablo) and CMS (Derek or Leo).
Step 8: Run dcache-regression tests again. We open dCache for public access after we are confident with its status.
Step 9: (TBD post-maintenance-window!) Finally, when we've been running fine for some days, we re-enable RW on the Thumpers and perform a consistency check with Derek tools (because file deletions may have been issued but not finalized due to the RO state). It will be performed both for files in pools not in pnfs, and files in pnfs not in pools.

The thumpers are just going to be upgraded to the newest dCache version. The config files will remain the same, since the core Lm service will have the same name.

Computing. Monday 03/05/2010 [PO]

Preconditions

move xenU's to PhaseC
- lrms02
- ce02 / ce11
- arc02
integrate Lustre on CE's
rsync experiment software and software VO tags to lustre
add CE's to site-BDII (production queues disabled) / add to GOC DB
- check GStat
- check GStat 2.0
- check SAM

Execution

Step 0: stop queues/services and place a firewalling rule if needed (downtime in GOC DB)
Step 1: rsync experiment software and software VO tags to lustre
Step 2: enable production queues with reduced number of WN's

After storage is production ready:

Step 3: recreate CE services from PhaseB on PhaseC
- ce01
- arc01

Other services. Monday 03/05/2010 [ALL]

VO-Boxes
- CMS: Migration by Derek/Leo
- ATLAS: Migration by CSCS

Monitoring

Ganglia
- Service Nodes
- Worker Nodes
- File Server
Nagios

Firewall

Revise and cleanup firewall rules according to the new IP addresses and services
Remember to verify under SAM testing if new rules break any service (during the next steps)

Testing and experimental boot-up sequence. Tuesday 04/05/2010 at 15:00 local time

Ensure all services to be tested are now up
Rerun dcache-regression test and local queues tests (/home/fotis/bin/CE_testqueue), globus-job-run ce01 /bin/hostame, glite-wms-job-list-match; observe and verify results.
Bring firewall rules open and set ops/dteam/dech queues open.
Let VO contacts run interactively an application for UIs and WNs to test storage. We will coordinate the VO contacts for this purpose, but this will not happen before friday afternoon.
Let SAM tests run, check nagios instances, ganglia plots etc.
If all looks OK, gradually open all queues, while still within downtime window and observe outcomes

Fall-back scenarios for 06-07/05/2010 (stopgap)

If all fails, fall-back to PhaseB until we get a better understanding of any issues
With Storage, in case of rollback, we'll copy all new files in Thors outside of dCache with the full path, and restore them back in the PhaseB when it's back online with gridftp commands. There are many ways to know which files are those and their path, but most likely will be done with the ssh interface.

On Thursday 06/05/2010 at 12:00 UTC we strive to have a production system UP.

-- PabloFernandez - 2010-04-28