Scheduled Maintenance on 2013-07-03
The next first working Wednesday of the month we will go into Scheduled Downtime. It will last from 9:00 to 18:00, but we will return to operation as soon as we finish.
As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.
_ REMOVE: REMEMBER TO ADD DOWNTIME IN GOCGB Queues will be closed according to schedule:
Jun 14 12:07 [root@lrms02:~]# echo "qdisable atlas" | at -m 9am 2.07.13
job 100 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable atlashimem" | at -m 9am 2.07.13
job 101 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable cms" | at -m 9am 2.07.13
job 102 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable other" | at -m 9am 2.07.13
job 103 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable lhcb" | at -m 9am 1.07.13
job 104 at 2013-07-01 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable lcgadmin" | at -m 8:30am 3.07.13
job 105 at 2013-07-03 08:30
Jun 14 12:07 [root@lrms02:~]# atq
100 2013-07-02 09:00 a root
102 2013-07-02 09:00 a root
105 2013-07-03 08:30 a root
103 2013-07-02 09:00 a root
101 2013-07-02 09:00 a root
104 2013-07-01 09:00 a root
Summary of interventions
We will perform the following operations on the cluster:
Restrict squid
- Description:Restrict squid so only RAL servers can be accessed from the squid proxy
- Affected nodes:
cvmfs1
, cvmfs
, wn[01-78]
- Notes: Add the following to
squid.conf
acl ral dst cernvmfs.gridpp.rl.ac.uk
acl ral dst cvmfs.racf.bnl.gov
acl cvmfs dst cvmfs-stratum-one.cern.ch
acl cvmfs dst cernvmfs.gridpp.rl.ac.uk
acl cvmfs dst cvmfs.racf.bnl.gov
acl cvmfs dst cvmfs02.grid.sinica.edu.tw
acl cvmfs dst cvmfs.fnal.gov
acl cvmfs dst cvmfs-atlas-nightlies.cern.ch
And update http access rule for localnet
http_access allow localnet ral
http_access allow localnet cvmfs
Update worker nodes to SL6
- Description:Worker nodes will be updated to SL6
- Affected nodes: All worker nodes
- Notes:With this update we will be able to use the OFED stack bundled in SL6 and remove Mellanx OFED from the install process. Refinement of the install process is also to be improved using internal repos, reboots during provisioning are to be kept to a minimum. Also install mcelog to monitor for memory errorrs
Restart the BDII services to ensure we are publishing the correct information.
Update cvmfs
- Description: In SL6, cvmfs needs to be updated to 2.1
- Affected nodes: All worker nodes
- Notes: We also have to mount cvms in RW mode. Consult web-rt ticket #13573.
Restart pbs and dcache services
- Description: After the DNS change we need to restart services querying old systems.
- Affected nodes:
se[01-14], storage0[1,2]
and lrms0[1-2]
- Notes: Check ticket #13546
Decommission KVM01
- Description:Remaining VMs are to be moved form this host to KVM01 can be decommissioned
- Affected nodes: Pub, UI64, ppcvmfs
- Notes:
-
pub
is still at 5.4, reinstall with 6.4
-
ui
has been installed on KVM03, this will replace ui64
-
ppcvmfs
to be moved to pre production KVM host.
Decommission old voboxes
- Description: Old voboxes need to be decommissioned.
- Affected nodes:
cmsvobox
and atlasvobox
- Notes:
atlasvobox
can be shutdown but NOT cmsvobox
(it has been moved to a kvm VM until CMS is ready). atlasvobox
VM disks have been moved to /kvm02/
Migrate lrms02 to kvm
- Description: right now
lrms02
is still a Xen VM that needs to be migrated to KVM.
- Affected nodes:
lrms02
- Notes: Check the process followed in the previous maintenance.
Update kernels of SL6 machines
- Description:CVE-2013-2094 allows privilege escalation from standard user to root
- Affected nodes:
ui
logstash
(NO) Storage01, Storage02
(NO) Cream01, Cream02, Cream03
SBDII01, SBDII02, SBDII03
APEL
(NO) KVM02, KVM03
- Notes: Machines are not user facing
Update CREAM-CE to last release
- Description: Update all CREAM-CEs to last UMD-2 release.
- Affected nodes:
cream01
, cream02
, cream03
- Notes: Need to run also YAIM
Update ntp servers
Expand dCache monitoring
- Description:Add monitoring tools to gain better awareness over what is happening within dcache
- Affected nodes: storage01.lcg.cscs.ch
- Notes: Enable the dcache statistics and install srmwatch
Details for enabling statistics
http://www.dcache.org/manuals/Book-2.2/config/cf-statistics-basic-fhs.shtml
http://www.dcache.org/manuals/Book-2.2/config/cf-statistics-webPage-fhs.shtml
SRM watch
http://www.dcache.org/manuals/Book-1.9.5/config/cf-srm-monitor.shtml
example running at FNAL
http://cmsdcam3.fnal.gov:8081/srmwatch/
Fix errors found in dCache
- Description:There is an incorrect path in the LinkGroupAuthorization file and dcache servers require fetch-crl
- Affected nodes: storage01.lcg.cscs.ch, storage02.lcg.cscs.ch and all se machines
- Notes: Whilst troubleshooting dCache issues some errors have been found.
The
LinkGroupAuthorization.conf is in /etc/dcache not /opt/d-cache/config/
Jun 27 14:31 [root@nfs02:DCACHE22]# grep opt dcache.* | grep -v port
dcache.conf:# Refer to /usr/share/dcache/defaults/dcache.properties for further options
dcache.conf.pools.sepools3_22:# Source: /opt/d-cache//config/dCacheSetup
dcache.conf.pools.sepools3_22:SpaceManagerLinkGroupAuthorizationFileName=/opt/d-cache/etc/LinkGroupAuthorization.conf
dcache.conf.pools.sepools4_22:# Source: /opt/d-cache//config/dCacheSetup
dcache.conf.pools.sepools4_22:SpaceManagerLinkGroupAuthorizationFileName=/opt/d-cache/etc/LinkGroupAuthorization.conf
Machines need fetch-crl installed and have the cron job enabled as there is currently no vomsdir under /etc/grid-security/