create new tag
view all tags

CSCS-LCG2 Tier2 migration, PhaseB to PhaseC


This document describes the technical steps in relation to CSCS-LCG2 Tier2 migration, PhaseB to PhaseC along with the related timeline.


  • Inform Users on migration plan - target: 28/04/2010
  • Take downtime window for Monday 03/05/2010 at 08:00 to Thursday 06/05/2010 at 12:00. Three+ days.
  • Execute in advance "downtime" script on ce01 to shut off the queues in time
  • PhaseC xen hosts in "production-ready" state: xen11, xen12, xen13, xen14, xen15
  • PhaseC xen virtual machines in "production-ready" state, controlled by cfengine: CE*s, SE*s

Queue closure on PhaseB. Friday 30/04/2010 at 15:00 Swiss time

  • No more jobs will be accepted at that moment, but running jobs will still continue working.

Decommision PhaseB, preparation on Friday and Monday

  • Step 1: check all computing queues are stopped/empty
  • Step 2: check for any pending transfers and stop dCache
  • Step 3: firewall local storage services (storage01, storage02 & ui)
  • Step 4: prepare the system for roll-back; ensure back-ups are present etc.
  • Step 5: Prepare Lustre mounts on all PhaseC nodes and VMs, ensure good state of PhaseC before action starts.

Storage. Monday 03/05/2010 [PF]


  • Step 1: Ensure closure of the firewall to external access. Upgrade dCache version to the latest and do at least one restart cycle (this enforces cleanup of deleted and partially transferred files).
  • Step 2: Perform a dcache-regression Tests (including scripts from Derek and Pablo). Check everything looks sane, then stop dCache service in all PhaseB nodes

Then core nodes will be changed from PhaseB machines to PhaseC. A new dCache instance will be ready on the new machines, and the config files in place, so they need to be renamed (and re-IPd) to work as the other ones. The procedure will be in this order:

  • Step 3: Make two database dumps from PhaseB storage02: dcache and pnfs.
  • Step 4: Start pnfs on PhaseB and migrate chimera to PhaseC (leaving pnfs untouched)
  • Step 5: Restore the dcache database on PhaseC storage01 and point the dCacheConfig file to access the database locally (this also requires to move the PinManager to storage01
  • Step 6: Shut down network on PhaseB core nodes and exchange names with PhaseC core nodes.

At this step we start chimera and dCache services and test the status (still not open to outside access).

After dCache core services are checked, we start with the Thors.

  • Step 7: We make all HEP VOs Thumper pools READONLY on the PoolManager, and all Thors READWRITE. Test from ATLAS VO (Fotis or Pablo) and CMS (Derek or Leo).
  • Step 8: Run dcache-regression tests again. We open dCache for public access after we are confident with its status.
  • Step 9: (TBD post-maintenance-window!) Finally, when we've been running fine for some days, we re-enable RW on the Thumpers and perform a consistency check with Derek tools (because file deletions may have been issued but not finalized due to the RO state). It will be performed both for files in pools not in pnfs, and files in pnfs not in pools.

The thumpers are just going to be upgraded to the newest dCache version. The config files will remain the same, since the core Lm service will have the same name.

Computing. Monday 03/05/2010 [PO]


  • move xenU's to PhaseC
    • DONE lrms02
    • DONE ce02 / ce11
    • arc02
  • DONE integrate Lustre on CE's
  • rsync experiment software and software VO tags to lustre
  • add CE's to site-BDII (production queues disabled) / add to GOC DB


  • Step 0: stop queues/services and place a firewalling rule if needed (downtime in GOC DB)
  • Step 1: rsync experiment software and software VO tags to lustre
  • Step 2: enable production queues with reduced number of WN's

After storage is production ready:

  • Step 3: recreate CE services from PhaseB on PhaseC
    • ce01
    • arc01

Other services. Monday 03/05/2010 [ALL]

  • VO-Boxes
    • CMS: Migration by Derek/Leo
    • ATLAS: Migration by CSCS


  • Ganglia
    • DONE Service Nodes
    • DONE Worker Nodes
    • DONE File Server
  • Nagios


  • Revise and cleanup firewall rules according to the new IP addresses and services
  • Remember to verify under SAM testing if new rules break any service (during the next steps)

Testing and experimental boot-up sequence. Tuesday 04/05/2010 at 15:00 local time

  • Ensure all services to be tested are now up
  • Rerun dcache-regression test and local queues tests (/home/fotis/bin/CE_testqueue), globus-job-run ce01 /bin/hostame, glite-wms-job-list-match; observe and verify results.
  • Bring firewall rules open and set ops/dteam/dech queues open.
  • Let VO contacts run interactively an application for UIs and WNs to test storage. We will coordinate the VO contacts for this purpose, but this will not happen before friday afternoon.
  • Let SAM tests run, check nagios instances, ganglia plots etc.
  • If all looks OK, gradually open all queues, while still within downtime window and observe outcomes

Fall-back scenarios for 06-07/05/2010 (stopgap)

  • If all fails, fall-back to PhaseB until we get a better understanding of any issues
  • With Storage, in case of rollback, we'll copy all new files in Thors outside of dCache with the full path, and restore them back in the PhaseB when it's back online with gridftp commands. There are many ways to know which files are those and their path, but most likely will be done with the ssh interface.

On Thursday 06/05/2010 at 12:00 UTC we strive to have a production system UP.

-- PabloFernandez - 2010-04-28

Edit | Attach | Watch | Print version | History: r24 < r23 < r22 < r21 < r20 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r24 - 2011-02-14 - PabloFernandez
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback