Scheduled Maintenance on 2011-06-08
Next Wednesday we will go into Scheduled Downtime. It will last from 9:00 to 18:00, but we will return to operation as soon as we finish.
As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.
Summary of interventions
We will perform the following operations on the cluster:
- Torque upgrade to 2.4.13
- Upgrade WN
- NFS DRBD fix
- Torque/Thumper firmware update
- Xen14/15 cfengine integration
- Enable glexec capability on CreamCE
- Installation of EMI1 CREAM-CE in cream02
- Apply Argus policies script from-groupmap-to-policy.sh
Torque upgrade to 2.4.13
- Description: Torque has been working with no HA for almost two months, because of instabilities in our current version (2.4.11). Torque support team has been working on this, and the only way to go forward with the ticket is to upgrade to 2.4.13. This may not solve the issue, but the only way to know is to try.
- Affected nodes: All computing nodes: Cream, Arc, Lrms, WNs
Upgrade WN
- Description: There is a set of WNs [141-195] not yet upgraded to SL5.5. We will do that now.
- Affected nodes: wn141-wn195
dsh -w wn[141-195] 'rpm -U http://ftp.scientificlinux.org/linux/scientific/55/x86_64/SL/yum-conf-55-1.SL.noarch.rpm'
dsh -w wn[141-195] 'rpm -e srptools-debuginfo-0.0.4-1.ofed1.4.1.1.1.2 srptools-0.0.4-6.el5'
dsh -w wn[141-195] 'yum clean all'
dsh -w wn[141-195] 'yum update -y'
dsh -w wn[141-195] 'rpm -e perl-XML-LibXML-1.70-2.el5.rf.x86_64'
dsh -w wn[141-195] 'rpm_clone -r -f /opt/cscs/etc/pgklist/pkglist_WN_sun --enablerepo=cscs,glite*,dag,atlas* -y'
dsh -w wn[141-195] 'rpm_clone -d -f /opt/cscs/etc/pgklist/pkglist_WN_sun --ignore-exclude' | dshbak -c
dsh -w wn[141-195] 'rpm_clone -r -f /opt/cscs/etc/pgklist/pkglist_WN_sun --ignore-exclude -y'
dsh -g WN 'rpm_clone -d -f /opt/cscs/etc/pgklist/pkglist_WN_sun --ignore-exclude' | dshbak -c
- Side note: During the maintenance, a new gLite release was out for WNs. We have also installed it.
NFS DRBD fix
- Description: There was a mistake in the DRBD configuration in NFS that could prevent the Secondary become primary on Failover situation
- Affected nodes: nfs01/nfs02, and all computing nodes: Cream, Arc, Lrms, WNs
Torque/Thumper firmware update
- Description: Ther disk controller on Thors and Thumpers suffers from sporadic freeze-ups. This affects a node every month, and hopefully a Firmware Upgrade would solve it (or not)
- Affected nodes: All dCache nodes. dCache service will be interrupted.
- Side note: During the maintenance, we also upgraded dcache to the latest bugfix version.
Xen14/15 cfengine integration
- Description: Xen14 and xen15 are not controlled by cfengine. This is an ancient problem, we'll try to solve it now.
- Affected nodes: xen14, xen15, and all VMs inside: Argus, Lrms, Cream, Arc.
Installation of EMI1 CREAM-CE in cream02
- Description: Since we are early adopters of EMI CREAM-CE, we need to upgrade one of the cream servers to use the new EMI1 release.
- Affected nodes: cream02
- Process:
- Shutdown
cream02
and make a backup of it dd if=/dev/vg_root/cream02_root bs=1M | gzip | ssh miguelgi@ui64 "dd of=./bck.cream02.before.EMI1.img.gzip bs=1M"
- cfengine Files to modify in CREAM directory make sure permissions of copied files and dirs are correct:
- /etc/yum.repos.d/glite-CREAM.repo --> move from CREAM to ppcream01
- /etc/yum.repos.d/glite-TORQUE_utils.repo -> move from CREAM to ppcream01
- /usr/share/tomcat5/webapps/ce-cream/WEB-INF/jobwrapper.tpl --> move from CREAM to ppcream01
- /etc/sudoers --> move from CREAM to ppcream01
- /etc/profile.d/CSCS.sh --> (make diff with the one in regular machine) and move from CREAM to ppcream01
- cfengine Files to copy from cfengine of ppcream02 to CREAM directory make sure permissions of copied files and dirs are correct:
- /etc/sudoers
- /etc/profile.d/CSCS.sh
- /var/lib/tomcat5/webapps/ce-cream/WEB-INF/jobwrapper.tpl
- cfengine Files/directories to copy from PPCREAM_CE to CREAM_CE directory make sure permissions of copied files and dirs are correct
- /etc/grid-security/gridmapdir
- /etc/pki/rpm-gpg
- /lustre/scratch
- /opt/edg/var/info/
- /var/lib/tomcat5/webapps
- /var/log/cream
- /var/log/tomcat5
- /var/spool/pbs/server_priv/accounting
- /var/spool/pbs/server_priv/server_name MODIFY ACCORDING TO PRODUCTION CREAM!!!
- cfengine Files to be MERGED from PPS directory to ANY directory
- /srv/cfengine/files/PPS/opt/cscs/siteinfo/nodes/ppcream02.lcg.cscs.ch
- /srv/cfengine/files/ANY/opt/cscs/siteinfo/nodes/cream02.lcg.cscs.ch
- cfengine, inputs Files to be MERGED
- cf.CREAM_CE
- cf.PPCREAM_CE
- Install the machine following instructions on https://wiki.chipp.ch/twiki/bin/view/LCGTier2/XenSampleImageReplication and https://wiki.chipp.ch/twiki/bin/view/LCGTier2/ServiceConfiguration
- Once the machine is installed, install the CREAMCE service: https://wiki.chipp.ch/twiki/bin/view/LCGTier2/ServiceCreamCE
Enable glexec capability on CreamCE
THIS STEP DEPENDS ON PREVIOUS STEPS:
Installation of EMI1 CREAM-CE
- Description: To be able to use glExec from outside, this capability has to be announced through CreamCE.
- Also, a few RPMs must be installed and YAIM must be run in the remaining WNs (wn141-wn195)
- Affected nodes: cream01, cream02, WNs (wn141-wn195)
- Process:
- For installing glExec in the WNs follow the instructions in the TWiki
- For enabling the glexec capability in the CREAMs modify
/srv/cfengine/files/ANY/opt/cscs/siteinfo/site-info.def
according to PPS: CE_CAPABILITY="CPUScalingReferenceSI00=2500 Share=atlas:40 Share=cms:40 Share=lhcb:20 glexec"
Some extra information: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#site_info_def
- Run
cfengine
and yaim
in the affected CREAMs.
Apply Argus policies script from-groupmap-to-policy.sh
- Description: There is a script that automatically matches policies in the groupmap file to agus policies. So, it needs to be deployed.
- Affected nodes: argus01, argus02
- Process:
- Apply policies script (
from-groupmap-to-policy.sh
) to argus01
making diff with local policies
- Apply the same script to
argus02