Tags:
create new tag
view all tags

Scheduled Maintenance on 2011-06-08

Next Wednesday we will go into Scheduled Downtime. It will last from 9:00 to 18:00, but we will return to operation as soon as we finish.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

Summary of interventions

We will perform the following operations on the cluster:

  • Torque upgrade to 2.4.13
  • Upgrade WN
  • NFS DRBD fix
  • Torque/Thumper firmware update
  • Xen14/15 cfengine integration

  • Enable glexec capability on CreamCE
  • Installation of EMI1 CREAM-CE in cream02
  • Apply Argus policies script from-groupmap-to-policy.sh


DONE Torque upgrade to 2.4.13

  • Description: Torque has been working with no HA for almost two months, because of instabilities in our current version (2.4.11). Torque support team has been working on this, and the only way to go forward with the ticket is to upgrade to 2.4.13. This may not solve the issue, but the only way to know is to try.

  • Affected nodes: All computing nodes: Cream, Arc, Lrms, WNs

DONE Upgrade WN

  • Description: There is a set of WNs [141-195] not yet upgraded to SL5.5. We will do that now.
  • Affected nodes: wn141-wn195
    dsh -w wn[141-195] 'rpm -U http://ftp.scientificlinux.org/linux/scientific/55/x86_64/SL/yum-conf-55-1.SL.noarch.rpm'
    dsh -w wn[141-195] 'rpm -e srptools-debuginfo-0.0.4-1.ofed1.4.1.1.1.2 srptools-0.0.4-6.el5'
    dsh -w wn[141-195] 'yum clean all'
    dsh -w wn[141-195] 'yum update -y'
    dsh -w wn[141-195] 'rpm -e perl-XML-LibXML-1.70-2.el5.rf.x86_64'
    dsh -w wn[141-195] 'rpm_clone -r -f /opt/cscs/etc/pgklist/pkglist_WN_sun --enablerepo=cscs,glite*,dag,atlas* -y'
    dsh -w wn[141-195] 'rpm_clone -d -f /opt/cscs/etc/pgklist/pkglist_WN_sun --ignore-exclude' | dshbak -c
    dsh -w wn[141-195] 'rpm_clone -r -f /opt/cscs/etc/pgklist/pkglist_WN_sun --ignore-exclude -y'
    dsh -g WN 'rpm_clone -d -f /opt/cscs/etc/pgklist/pkglist_WN_sun --ignore-exclude' | dshbak -c

  • Side note: During the maintenance, a new gLite release was out for WNs. We have also installed it.

DONE NFS DRBD fix

  • Description: There was a mistake in the DRBD configuration in NFS that could prevent the Secondary become primary on Failover situation
  • Affected nodes: nfs01/nfs02, and all computing nodes: Cream, Arc, Lrms, WNs

DONE Torque/Thumper firmware update

  • Description: Ther disk controller on Thors and Thumpers suffers from sporadic freeze-ups. This affects a node every month, and hopefully a Firmware Upgrade would solve it (or not)
  • Affected nodes: All dCache nodes. dCache service will be interrupted.

  • Side note: During the maintenance, we also upgraded dcache to the latest bugfix version.

DONE Xen14/15 cfengine integration

  • Description: Xen14 and xen15 are not controlled by cfengine. This is an ancient problem, we'll try to solve it now.
  • Affected nodes: xen14, xen15, and all VMs inside: Argus, Lrms, Cream, Arc.

DONE Installation of EMI1 CREAM-CE in cream02

  • Description: Since we are early adopters of EMI CREAM-CE, we need to upgrade one of the cream servers to use the new EMI1 release.
  • Affected nodes: cream02
  • Process:
    1. Shutdown cream02 and make a backup of it
      dd if=/dev/vg_root/cream02_root bs=1M | gzip | ssh miguelgi@ui64 "dd of=./bck.cream02.before.EMI1.img.gzip bs=1M"
    2. cfengine Files to modify in CREAM directory make sure permissions of copied files and dirs are correct:
      • /etc/yum.repos.d/glite-CREAM.repo --> move from CREAM to ppcream01
      • /etc/yum.repos.d/glite-TORQUE_utils.repo -> move from CREAM to ppcream01
      • /usr/share/tomcat5/webapps/ce-cream/WEB-INF/jobwrapper.tpl --> move from CREAM to ppcream01
      • /etc/sudoers --> move from CREAM to ppcream01
      • /etc/profile.d/CSCS.sh --> (make diff with the one in regular machine) and move from CREAM to ppcream01
    3. cfengine Files to copy from cfengine of ppcream02 to CREAM directory make sure permissions of copied files and dirs are correct:
      • /etc/sudoers
      • /etc/profile.d/CSCS.sh
      • /var/lib/tomcat5/webapps/ce-cream/WEB-INF/jobwrapper.tpl
    4. cfengine Files/directories to copy from PPCREAM_CE to CREAM_CE directory make sure permissions of copied files and dirs are correct
      • /etc/grid-security/gridmapdir
      • /etc/pki/rpm-gpg
      • /lustre/scratch
      • /opt/edg/var/info/
      • /var/lib/tomcat5/webapps
      • /var/log/cream
      • /var/log/tomcat5
      • /var/spool/pbs/server_priv/accounting
      • /var/spool/pbs/server_priv/server_name MODIFY ACCORDING TO PRODUCTION CREAM!!!
    5. cfengine Files to be MERGED from PPS directory to ANY directory
      • /srv/cfengine/files/PPS/opt/cscs/siteinfo/nodes/ppcream02.lcg.cscs.ch
      • /srv/cfengine/files/ANY/opt/cscs/siteinfo/nodes/cream02.lcg.cscs.ch
    6. cfengine, inputs Files to be MERGED
      • cf.CREAM_CE
      • cf.PPCREAM_CE
    7. Install the machine following instructions on https://wiki.chipp.ch/twiki/bin/view/LCGTier2/XenSampleImageReplication and https://wiki.chipp.ch/twiki/bin/view/LCGTier2/ServiceConfiguration
    8. Once the machine is installed, install the CREAMCE service: https://wiki.chipp.ch/twiki/bin/view/LCGTier2/ServiceCreamCE

DONE Enable glexec capability on CreamCE

THIS STEP DEPENDS ON PREVIOUS STEPS: Installation of EMI1 CREAM-CE

  • Description: To be able to use glExec from outside, this capability has to be announced through CreamCE.
  • Also, a few RPMs must be installed and YAIM must be run in the remaining WNs (wn141-wn195)
  • Affected nodes: cream01, cream02, WNs (wn141-wn195)
  • Process:
    1. For installing glExec in the WNs follow the instructions in the TWiki
    2. For enabling the glexec capability in the CREAMs modify /srv/cfengine/files/ANY/opt/cscs/siteinfo/site-info.def according to PPS:
      CE_CAPABILITY="CPUScalingReferenceSI00=2500 Share=atlas:40 Share=cms:40 Share=lhcb:20 glexec"
      Some extra information: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#site_info_def
    3. Run cfengine and yaim in the affected CREAMs.

DONE Apply Argus policies script from-groupmap-to-policy.sh

  • Description: There is a script that automatically matches policies in the groupmap file to agus policies. So, it needs to be deployed.
  • Affected nodes: argus01, argus02
  • Process:
    1. Apply policies script ( from-groupmap-to-policy.sh) to argus01 making diff with local policies
    2. Apply the same script to argus02
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r10 - 2011-06-08 - PabloFernandez
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback