Scheduled Maintenance on 2013-07-03

The next first working Wednesday of the month we will go into Scheduled Downtime. It will last from 9:00 to 18:00, but we will return to operation as soon as we finish.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

_ REMOVE: REMEMBER TO ADD DOWNTIME IN GOCGB Queues will be closed according to schedule:

Jun 14 12:07 [root@lrms02:~]# echo "qdisable atlas" | at -m 9am 2.07.13
job 100 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable atlashimem" | at -m 9am 2.07.13
job 101 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable cms" | at -m 9am 2.07.13
job 102 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable other" | at -m 9am 2.07.13
job 103 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable lhcb" | at -m 9am 1.07.13
job 104 at 2013-07-01 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable lcgadmin" | at -m 8:30am 3.07.13
job 105 at 2013-07-03 08:30
Jun 14 12:07 [root@lrms02:~]# atq
100   2013-07-02 09:00 a root
102   2013-07-02 09:00 a root
105   2013-07-03 08:30 a root
103   2013-07-02 09:00 a root
101   2013-07-02 09:00 a root
104   2013-07-01 09:00 a root

Summary of interventions

We will perform the following operations on the cluster:

Restrict squid

Description:Restrict squid so only RAL servers can be accessed from the squid proxy
Affected nodes: cvmfs1, cvmfs, wn[01-78]

Notes: Add the following to squid.conf

acl ral dst cernvmfs.gridpp.rl.ac.uk
acl ral dst cvmfs.racf.bnl.gov

acl cvmfs dst cvmfs-stratum-one.cern.ch
acl cvmfs dst cernvmfs.gridpp.rl.ac.uk
acl cvmfs dst cvmfs.racf.bnl.gov
acl cvmfs dst cvmfs02.grid.sinica.edu.tw
acl cvmfs dst cvmfs.fnal.gov
acl cvmfs dst cvmfs-atlas-nightlies.cern.ch

And update http access rule for localnet

http_access allow localnet ral
http_access allow localnet cvmfs

Update worker nodes to SL6

Description:Worker nodes will be updated to SL6
Affected nodes: All worker nodes
Notes:With this update we will be able to use the OFED stack bundled in SL6 and remove Mellanx OFED from the install process. Refinement of the install process is also to be improved using internal repos, reboots during provisioning are to be kept to a minimum. Also install mcelog to monitor for memory errorrs

Restart the BDII services to ensure we are publishing the correct information.

Update cvmfs

Description: In SL6, cvmfs needs to be updated to 2.1
Affected nodes: All worker nodes
Notes: We also have to mount cvms in RW mode. Consult web-rt ticket #13573.

Restart pbs and dcache services

Description: After the DNS change we need to restart services querying old systems.
Affected nodes: se[01-14], storage0[1,2] and lrms0[1-2]
Notes: Check ticket #13546

Decommission KVM01

Description:Remaining VMs are to be moved form this host to KVM01 can be decommissioned
Affected nodes: Pub, UI64, ppcvmfs
Notes:
1. pub is still at 5.4, reinstall with 6.4
2. ui has been installed on KVM03, this will replace ui64
3. ppcvmfs to be moved to pre production KVM host.

Decommission old voboxes

Description: Old voboxes need to be decommissioned.
Affected nodes: cmsvobox and atlasvobox
Notes: atlasvobox can be shutdown but NOT cmsvobox (it has been moved to a kvm VM until CMS is ready). atlasvobox VM disks have been moved to /kvm02/

Migrate lrms02 to kvm

Description: right now lrms02 is still a Xen VM that needs to be migrated to KVM.
Affected nodes: lrms02
Notes: Check the process followed in the previous maintenance.

Update kernels of SL6 machines

Description:CVE-2013-2094 allows privilege escalation from standard user to root
Affected nodes:
ui
logstash

(NO) Storage01, Storage02
(NO) Cream01, Cream02, Cream03
SBDII01, SBDII02, SBDII03
APEL
(NO) KVM02, KVM03

Notes: Machines are not user facing

Update CREAM-CE to last release

Description: Update all CREAM-CEs to last UMD-2 release.
Affected nodes: cream01, cream02, cream03
Notes: Need to run also YAIM

Update ntp servers

Description: time1.cscs.ch and time2.cscs.ch are the only ntp servers to be used as detailed here https://wiki.cscs.ch/mediawiki/index.php/Maintenance-reports:July_03_2013#Important
Affected nodes: All machines
Notes: Currently time1.cscs.ch, time2.cscs.ch and insone.admin.cscs.ch/ 148.187.12.21 are used.

Expand dCache monitoring

Description:Add monitoring tools to gain better awareness over what is happening within dcache
Affected nodes: storage01.lcg.cscs.ch
Notes: Enable the dcache statistics and install srmwatch

Details for enabling statistics

http://www.dcache.org/manuals/Book-2.2/config/cf-statistics-basic-fhs.shtml

http://www.dcache.org/manuals/Book-2.2/config/cf-statistics-webPage-fhs.shtml

SRM watch

http://www.dcache.org/manuals/Book-1.9.5/config/cf-srm-monitor.shtml

example running at FNAL http://cmsdcam3.fnal.gov:8081/srmwatch/

Fix errors found in dCache

Description:There is an incorrect path in the LinkGroupAuthorization file and dcache servers require fetch-crl
Affected nodes: storage01.lcg.cscs.ch, storage02.lcg.cscs.ch and all se machines
Notes: Whilst troubleshooting dCache issues some errors have been found.

The LinkGroupAuthorization.conf is in /etc/dcache not /opt/d-cache/config/

Jun 27 14:31 [root@nfs02:DCACHE22]# grep opt dcache.* | grep -v port
dcache.conf:# Refer to /usr/share/dcache/defaults/dcache.properties for further options
dcache.conf.pools.sepools3_22:# Source: /opt/d-cache//config/dCacheSetup
dcache.conf.pools.sepools3_22:SpaceManagerLinkGroupAuthorizationFileName=/opt/d-cache/etc/LinkGroupAuthorization.conf
dcache.conf.pools.sepools4_22:# Source: /opt/d-cache//config/dCacheSetup
dcache.conf.pools.sepools4_22:SpaceManagerLinkGroupAuthorizationFileName=/opt/d-cache/etc/LinkGroupAuthorization.conf

Machines need fetch-crl installed and have the cron job enabled as there is currently no vomsdir under /etc/grid-security/

This topic: LCGTier2 > WebHome > MaintenancesBoard > ScheduledMaintenanceOn20130703
Topic revision: r16 - 2013-07-03 - GianniRicciardi