Tags:
create new tag
view all tags

Scheduled Maintenance on 2013-07-03

The next first working Wednesday of the month we will go into Scheduled Downtime. It will last from 9:00 to 18:00, but we will return to operation as soon as we finish.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

Summary of interventions

We will perform the following operations on the cluster:


DONE Restrict squid

  • Description:Restrict squid so only RAL servers can be accessed from the squid proxy
  • Affected nodes: cvmfs1, cvmfs, wn[01-78]
  • Notes: Add the following to squid.conf
    acl ral dst cernvmfs.gridpp.rl.ac.uk
    acl ral dst cvmfs.racf.bnl.gov
    
    acl cvmfs dst cvmfs-stratum-one.cern.ch
    acl cvmfs dst cernvmfs.gridpp.rl.ac.uk
    acl cvmfs dst cvmfs.racf.bnl.gov
    acl cvmfs dst cvmfs02.grid.sinica.edu.tw
    acl cvmfs dst cvmfs.fnal.gov
    acl cvmfs dst cvmfs-atlas-nightlies.cern.ch
    
    And update http access rule for localnet
    http_access allow localnet ral
    http_access allow localnet cvmfs
    

DONE Update worker nodes to SL6

  • Description:Worker nodes will be updated to SL6
  • Affected nodes: All worker nodes
  • Notes:With this update we will be able to use the OFED stack bundled in SL6 and remove Mellanx OFED from the install process. Refinement of the install process is also to be improved using internal repos, reboots during provisioning are to be kept to a minimum. Also install mcelog to monitor for memory errorrs
Restart the BDII services to ensure we are publishing the correct information.

DONE Update cvmfs

  • Description: In SL6, cvmfs needs to be updated to 2.1
  • Affected nodes: All worker nodes
  • Notes: We also have to mount cvms in RW mode. Consult web-rt ticket #13573.

DONE Restart pbs and dcache services

  • Description: After the DNS change we need to restart services querying old systems.
  • Affected nodes: se[01-14], storage0[1,2] and lrms0[1-2]
  • Notes: Check ticket #13546

DONE Decommission KVM01

  • Description:Remaining VMs are to be moved form this host to KVM01 can be decommissioned
  • Affected nodes: Pub, UI64, ppcvmfs
  • Notes:
    1. DONE pub is still at 5.4, reinstall with 6.4
    2. DONE ui has been installed on KVM03, this will replace ui64
    3. DONE ppcvmfs to be moved to pre production KVM host.

DONE Decommission old voboxes

  • Description: Old voboxes need to be decommissioned.
  • Affected nodes: cmsvobox and atlasvobox
  • Notes: atlasvobox can be shutdown but NOT cmsvobox (it has been moved to a kvm VM until CMS is ready). atlasvobox VM disks have been moved to /kvm02/

DONE Migrate lrms02 to kvm

  • Description: right now lrms02 is still a Xen VM that needs to be migrated to KVM.
  • Affected nodes: lrms02
  • Notes: Check the process followed in the previous maintenance.

DONE Update kernels of SL6 machines

  • Description:CVE-2013-2094 allows privilege escalation from standard user to root
  • Affected nodes:
    DONE ui
    DONE logstash
(NO) Storage01, Storage02
(NO) Cream01, Cream02, Cream03
DONE SBDII01, SBDII02, SBDII03
DONE APEL
(NO) KVM02, KVM03
  • Notes: Machines are not user facing

DONE Update CREAM-CE to last release

  • Description: Update all CREAM-CEs to last UMD-2 release.
  • Affected nodes: cream01, cream02, cream03
  • Notes: Need to run also YAIM

DONE Update ntp servers

DONE Expand dCache monitoring

  • Description:Add monitoring tools to gain better awareness over what is happening within dcache
  • Affected nodes: storage01.lcg.cscs.ch
  • Notes: Enable the dcache statistics and install srmwatch

DONE Details for enabling statistics

DONE http://www.dcache.org/manuals/Book-2.2/config/cf-statistics-basic-fhs.shtml

DONE http://www.dcache.org/manuals/Book-2.2/config/cf-statistics-webPage-fhs.shtml

THIS IS FOR DCACHE 1.9.X NOT INSTALLED

SRM watch

http://www.dcache.org/manuals/Book-1.9.5/config/cf-srm-monitor.shtml

example running at FNAL http://cmsdcam3.fnal.gov:8081/srmwatch/

DONE Fix errors found in dCache

  • Description:There is an incorrect path in the LinkGroupAuthorization file and dcache servers require fetch-crl
  • Affected nodes: storage01.lcg.cscs.ch, storage02.lcg.cscs.ch and all se machines
  • Notes: Whilst troubleshooting dCache issues some errors have been found.
The LinkGroupAuthorization.conf is in /etc/dcache not /opt/d-cache/config/

Jun 27 14:31 [root@nfs02:DCACHE22]# grep opt dcache.* | grep -v port
dcache.conf:# Refer to /usr/share/dcache/defaults/dcache.properties for further options
dcache.conf.pools.sepools3_22:# Source: /opt/d-cache//config/dCacheSetup
dcache.conf.pools.sepools3_22:SpaceManagerLinkGroupAuthorizationFileName=/opt/d-cache/etc/LinkGroupAuthorization.conf
dcache.conf.pools.sepools4_22:# Source: /opt/d-cache//config/dCacheSetup
dcache.conf.pools.sepools4_22:SpaceManagerLinkGroupAuthorizationFileName=/opt/d-cache/etc/LinkGroupAuthorization.conf

Machines need fetch-crl installed and have the cron job enabled as there is currently no vomsdir under /etc/grid-security/

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r20 - 2013-07-03 - GeorgeBrown
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback