Tags:
create new tag
view all tags

Scheduled Maintenance on 2011-08-03

The next first working Wednesday of the month we will go into Scheduled Downtime. It will last from 8:00 to 18:00, but we will return to operation as soon as we finish.

This downtime is mainly an effort to converge to EMI1 version of the middleware, for compatibility with both Nordugrid/gLite stacks we use, ARGUS integration, and other bugfixes. See details below.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

Summary of interventions

We will perform the following operations on the cluster:

  • Worker Nodes upgrade to EMI1
  • CreamCEs upgrade to EMI1
  • Argus upgrade to EMI1
  • dCache upgrade to 1.9.5-27
  • Torque upgrade to 2.4.16
  • Scratch FS cleanup


TODO Worker Nodes upgrade to EMI1

  • Description: We were early adopters of this component, and it's running in production since more than one month ago. Now we need to apply the change to all worker nodes, which means we need to reinstall them all.
  • Affected nodes: all WNs
  • Notes: upgrade postponed to next maintenance, due to problems with EMI1 software

TODO CreamCEs upgrade to EMI1

  • Description: We were early adopters of this component, and it's running in production since more than one month ago. Now we need to apply the change to cream01, which requires a reinstall. Cream02 also has a newer version with bugfixes, and we will apply it.
  • Affected nodes: cream01/02
  • Notes: upgrade postponed to next maintenance, due to problems with EMI1 software

TODO Argus upgrade to EMI1

  • Description: We were early adopters of this component, and it's running in production since more than one month ago. A newer version is out, with some bugfixes, that we want to apply.
  • Affected nodes: argus01/02
  • Notes: upgrade postponed to next maintenance, due to problems with EMI1 software

DONE dCache upgrade to 1.9.5-27

  • Description: There is a newer version of dCache available, with some bugfixes, that we want to apply. As always, we will try to keep the storage downtime to a minimum, to avoid interruption of jobs at other sites.
  • Affected nodes: storage01/02, all SE Pools
  • Notes: JDK 1.6 was also updated to rev.26. Working since 10:00.

DONE Torque upgrade to 2.4.16

  • Description: Version 2.4.13 has a bug that causes pbs_mom to segfault under slow shared filesystem state. This new version fixes it.
  • Affected nodes: CreamCEs, ArcCEs, lrms01/02, all WNs.
  • Notes: Still missing: wn137, wn138

DONE Scratch FS cleanup

  • Description: We experienced problems with Lustre FS recently, and some files inside the /home/egee poolaccounts directory were left in a inconsistent state. We need to refresh it, either bringing Lustre to its initial state, or rebuilding the poolaccount directories.
  • Affected nodes: CreamCEs, ArcCEs, all WNs.
  • Notes: Installed 1.8.6 whamcloud build. The poolaccounts were rebuilt by yaim, as well as the cream sandboxes and Arc cache/dirs. Failover/stress tests worked as well.
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r5 - 2011-08-03 - PabloFernandez
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback