Scheduled Maintenance on 2011-06-08

Next Wednesday we will go into Scheduled Downtime. It will last from 9:00 to 18:00, but we will return to operation as soon as we finish.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

_ REMOVE: REMEMBER TO ADD DOWNTIME IN GOCGB_

Summary of interventions

We will perform the following operations on the cluster:

  • Torque upgrade to 2.4.13
  • Upgrade WN
  • NFS DRBD fix
  • Torque/Thumper firmware update
  • Xen14/15 cfengine integration

  • Enable glexec capability on CreamCE
  • Installation of EMI1 CREAM-CE in cream02
  • Apply Argus policies script from-groupmap-to-policy.sh


DONE Torque upgrade to 2.4.13

  • Description: Torque has been working with no HA for almost two months, because of instabilities in our current version (2.4.11). Torque support team has been working on this, and the only way to go forward with the ticket is to upgrade to 2.4.13. This may not solve the issue, but the only way to know is to try.

  • Affected nodes: All computing nodes: Cream, Arc, Lrms, WNs

DONE Upgrade WN

  • Description: There is a set of WNs [141-195] not yet upgraded to SL5.5. We will do that now.
  • Affected nodes: wn141-wn195
    dsh -w wn[141-195] 'rpm -U http://ftp.scientificlinux.org/linux/scientific/55/x86_64/SL/yum-conf-55-1.SL.noarch.rpm'
    dsh -w wn[141-195] 'rpm -e srptools-debuginfo-0.0.4-1.ofed1.4.1.1.1.2 srptools-0.0.4-6.el5'
    dsh -w wn[141-195] 'yum clean all'
    dsh -w wn[141-195] 'yum update -y'
    dsh -w wn[141-195] 'rpm -e perl-XML-LibXML-1.70-2.el5.rf.x86_64'
    dsh -w wn[141-195] 'rpm_clone -r -f /opt/cscs/etc/pgklist/pkglist_WN_sun --enablerepo=cscs,glite*,dag,atlas* -y'
    dsh -w wn[141-195] 'rpm_clone -d -f /opt/cscs/etc/pgklist/pkglist_WN_sun --ignore-exclude' | dshbak -c
    dsh -w wn[141-195] 'rpm_clone -r -f /opt/cscs/etc/pgklist/pkglist_WN_sun --ignore-exclude -y'
    dsh -g WN 'rpm_clone -d -f /opt/cscs/etc/pgklist/pkglist_WN_sun --ignore-exclude' | dshbak -c

NFS DRBD fix

  • Description: There was a mistake in the DRBD configuration in NFS that could prevent the Secondary become primary on Failover situation
  • Affected nodes: nfs01/nfs02, and all computing nodes: Cream, Arc, Lrms, WNs

Torque/Thumper firmware update

  • Description: Ther disk controller on Thors and Thumpers suffers from sporadic freeze-ups. This affects a node every month, and hopefully a Firmware Upgrade would solve it (or not)
  • Affected nodes: All dCache nodes. dCache service will be interrupted.

Xen14/15 cfengine integration

  • Description: Xen14 and xen15 are not controlled by cfengine. This is an ancient problem, we'll try to solve it now.
  • Affected nodes: xen14, xen15, and all VMs inside: Argus, Lrms, Cream, Arc.

Installation of EMI1 CREAM-CE in cream02

  • Description: Since we are early adopters of EMI CREAM-CE, we need to upgrade one of the cream servers to use the new EMI1 release.
  • Affected nodes: cream02
  • Process:
    1. Shutdown cream02 and make a backup of it
      dd if=/dev/vg_root/cream02_root bs=1M | gzip | ssh miguelgi@ui64 "dd of=./bck.cream02.before.EMI1.img.gzip bs=1M"
    2. cfengine Files to modify in CREAM directory make sure permissions of copied files and dirs are correct:
      • /etc/yum.repos.d/glite-CREAM.repo --> move from CREAM to ppcream01
      • /etc/yum.repos.d/glite-TORQUE_utils.repo -> move from CREAM to ppcream01
      • /usr/share/tomcat5/webapps/ce-cream/WEB-INF/jobwrapper.tpl --> move from CREAM to ppcream01
      • /etc/sudoers --> move from CREAM to ppcream01
      • /etc/profile.d/CSCS.sh --> (make diff with the one in regular machine) and move from CREAM to ppcream01
    3. cfengine Files to copy from cfengine of ppcream02 to CREAM directory make sure permissions of copied files and dirs are correct:
      • /etc/sudoers
      • /etc/profile.d/CSCS.sh
      • /var/lib/tomcat5/webapps/ce-cream/WEB-INF/jobwrapper.tpl
    4. cfengine Files/directories to copy from PPCREAM_CE to CREAM_CE directory make sure permissions of copied files and dirs are correct
      • /etc/grid-security/gridmapdir
      • /etc/pki/rpm-gpg
      • /lustre/scratch
      • /opt/edg/var/info/
      • /var/lib/tomcat5/webapps
      • /var/log/cream
      • /var/log/tomcat5
      • /var/spool/pbs/server_priv/accounting
      • /var/spool/pbs/server_priv/server_name MODIFY ACCORDING TO PRODUCTION CREAM!!!
    5. cfengine Files to be MERGED from PPS directory to ANY directory
      • /srv/cfengine/files/PPS/opt/cscs/siteinfo/nodes/ppcream02.lcg.cscs.ch
      • /srv/cfengine/files/ANY/opt/cscs/siteinfo/nodes/cream02.lcg.cscs.ch
    6. cfengine, inputs Files to be MERGED
      • cf.CREAM_CE
      • cf.PPCREAM_CE
    7. Install the machine following instructions on https://wiki.chipp.ch/twiki/bin/view/LCGTier2/XenSampleImageReplication and https://wiki.chipp.ch/twiki/bin/view/LCGTier2/ServiceConfiguration
    8. Once the machine is installed, install the CREAMCE service: https://wiki.chipp.ch/twiki/bin/view/LCGTier2/ServiceCreamCE

Enable glexec capability on CreamCE

THIS STEP DEPENDS ON PREVIOUS STEPS: Installation of EMI1 CREAM-CE

  • Description: To be able to use glExec from outside, this capability has to be announced through CreamCE.
  • Also, a few RPMs must be installed and YAIM must be run in the remaining WNs (wn141-wn195)
  • Affected nodes: cream01, cream02, WNs (wn141-wn195)
  • Process:
    1. For installing glExec in the WNs follow the instructions in the TWiki
    2. For enabling the glexec capability in the CREAMs modify /srv/cfengine/files/ANY/opt/cscs/siteinfo/site-info.def according to PPS:
      CE_CAPABILITY="CPUScalingReferenceSI00=2500 Share=atlas:40 Share=cms:40 Share=lhcb:20 glexec"
      Some extra information: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#site_info_def
    3. Run cfengine and yaim in the affected CREAMs.

DONE Apply Argus policies script from-groupmap-to-policy.sh

  • Description: There is a script that automatically matches policies in the groupmap file to agus policies. So, it needs to be deployed.
  • Affected nodes: argus01, argus02
  • Process:
    1. Apply policies script ( from-groupmap-to-policy.sh) to argus01 making diff with local policies
    2. Apply the same script to argus02
Edit | Attach | Watch | Print version | History: r10 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2011-06-08 - PabloFernandez
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback