Scheduled Maintenance on 2013-07-03

The next first working Wednesday of the month we will go into Scheduled Downtime. It will last from 9:00 to 18:00, but we will return to operation as soon as we finish.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

_ REMOVE: REMEMBER TO ADD DOWNTIME IN GOCGB Queues will be closed according to schedule:

Jun 14 12:07 [root@lrms02:~]# echo "qdisable atlas" | at -m 9am 2.07.13
job 100 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable atlashimem" | at -m 9am 2.07.13
job 101 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable cms" | at -m 9am 2.07.13
job 102 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable other" | at -m 9am 2.07.13
job 103 at 2013-07-02 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable lhcb" | at -m 9am 1.07.13
job 104 at 2013-07-01 09:00
Jun 14 12:07 [root@lrms02:~]# echo "qdisable lcgadmin" | at -m 8:30am 3.07.13
job 105 at 2013-07-03 08:30
Jun 14 12:07 [root@lrms02:~]# atq
100   2013-07-02 09:00 a root
102   2013-07-02 09:00 a root
105   2013-07-03 08:30 a root
103   2013-07-02 09:00 a root
101   2013-07-02 09:00 a root
104   2013-07-01 09:00 a root

Summary of interventions

We will perform the following operations on the cluster:


DONE Restrict squid

  • Description:Restrict squid so only RAL servers can be accessed from the squid proxy
  • Affected nodes: cvmfs1, cvmfs, wn[01-78]
  • Notes: Add the following to squid.conf
    acl ral dst cernvmfs.gridpp.rl.ac.uk
    acl ral dst cvmfs.racf.bnl.gov
    
    acl cvmfs dst cvmfs-stratum-one.cern.ch
    acl cvmfs dst cernvmfs.gridpp.rl.ac.uk
    acl cvmfs dst cvmfs.racf.bnl.gov
    acl cvmfs dst cvmfs02.grid.sinica.edu.tw
    acl cvmfs dst cvmfs.fnal.gov
    acl cvmfs dst cvmfs-atlas-nightlies.cern.ch
    
    And update http access rule for localnet
    http_access allow localnet ral
    http_access allow localnet cvmfs
    

Update worker nodes to SL6

  • Description:Worker nodes will be updated to SL6
  • Affected nodes: All worker nodes
  • Notes:With this update we will be able to use the OFED stack bundled in SL6 and remove Mellanx OFED from the install process. Refinement of the install process is also to be improved using internal repos, reboots during provisioning are to be kept to a minimum. Also install mcelog to monitor for memory errorrs
Restart the BDII services to ensure we are publishing the correct information.

Update cvmfs

  • Description: In SL6, cvmfs needs to be updated to 2.1
  • Affected nodes: All worker nodes
  • Notes: We also have to mount cvms in RW mode. Consult web-rt ticket #13573.

Restart pbs and dcache services

  • Description: After the DNS change we need to restart services querying old systems.
  • Affected nodes: se[01-14], storage0[1,2] and lrms0[1-2]
  • Notes: Check ticket #13546

DONE Decommission KVM01

  • Description:Remaining VMs are to be moved form this host to KVM01 can be decommissioned
  • Affected nodes: Pub, UI64, ppcvmfs
  • Notes:
    1. DONE pub is still at 5.4, reinstall with 6.4
    2. DONE ui has been installed on KVM03, this will replace ui64
    3. DONE ppcvmfs to be moved to pre production KVM host.

DONE Decommission old voboxes

  • Description: Old voboxes need to be decommissioned.
  • Affected nodes: cmsvobox and atlasvobox
  • Notes: atlasvobox can be shutdown but NOT cmsvobox (it has been moved to a kvm VM until CMS is ready). atlasvobox VM disks have been moved to /kvm02/

DONE Migrate lrms02 to kvm

  • Description: right now lrms02 is still a Xen VM that needs to be migrated to KVM.
  • Affected nodes: lrms02
  • Notes: Check the process followed in the previous maintenance.

Update kernels of SL6 machines

  • Description:CVE-2013-2094 allows privilege escalation from standard user to root
  • Affected nodes:
    DONE ui
    DONE logstash
(NO) Storage01, Storage02
(NO) Cream01, Cream02, Cream03
SBDII01, SBDII02, SBDII03
APEL
(NO) KVM02, KVM03
  • Notes: Machines are not user facing

Update CREAM-CE to last release

  • Description: Update all CREAM-CEs to last UMD-2 release.
  • Affected nodes: cream01, DONE cream02, DONE cream03
  • Notes: Need to run also YAIM

DONE Update ntp servers

Expand dCache monitoring

  • Description:Add monitoring tools to gain better awareness over what is happening within dcache
  • Affected nodes: storage01.lcg.cscs.ch
  • Notes: Enable the dcache statistics and install srmwatch

Details for enabling statistics

DONE http://www.dcache.org/manuals/Book-2.2/config/cf-statistics-basic-fhs.shtml

DONE http://www.dcache.org/manuals/Book-2.2/config/cf-statistics-webPage-fhs.shtml

SRM watch

http://www.dcache.org/manuals/Book-1.9.5/config/cf-srm-monitor.shtml

example running at FNAL http://cmsdcam3.fnal.gov:8081/srmwatch/

DONE Fix errors found in dCache

  • Description:There is an incorrect path in the LinkGroupAuthorization file and dcache servers require fetch-crl
  • Affected nodes: storage01.lcg.cscs.ch, storage02.lcg.cscs.ch and all se machines
  • Notes: Whilst troubleshooting dCache issues some errors have been found.
The LinkGroupAuthorization.conf is in /etc/dcache not /opt/d-cache/config/

Jun 27 14:31 [root@nfs02:DCACHE22]# grep opt dcache.* | grep -v port
dcache.conf:# Refer to /usr/share/dcache/defaults/dcache.properties for further options
dcache.conf.pools.sepools3_22:# Source: /opt/d-cache//config/dCacheSetup
dcache.conf.pools.sepools3_22:SpaceManagerLinkGroupAuthorizationFileName=/opt/d-cache/etc/LinkGroupAuthorization.conf
dcache.conf.pools.sepools4_22:# Source: /opt/d-cache//config/dCacheSetup
dcache.conf.pools.sepools4_22:SpaceManagerLinkGroupAuthorizationFileName=/opt/d-cache/etc/LinkGroupAuthorization.conf

Machines need fetch-crl installed and have the cron job enabled as there is currently no vomsdir under /etc/grid-security/

Edit | Attach | Watch | Print version | History: r20 | r18 < r17 < r16 < r15 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r16 - 2013-07-03 - GianniRicciardi
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback