Tags:
create new tag
view all tags

Scheduled Maintenance on 2013-04-09

The 9th of April 2013 we will go into Scheduled Downtime. It will last from 8:00 to 18:00, but we will return to operation as soon as we finish.

As usual, CMS and Atlas queues will be closed 24 hours before the maintenance, and LHCb queue will close 48 hours before the maintenance.

Summary of interventions

We will perform the following operations on the cluster:


DONE Upgrade of Compute Nodes to UMD-2

  • Description: All compute nodes need to be upgraded to UMD-2
  • Affected nodes: wn[01-79]
  • Notes: This would be a normal upgrade (SL5), OS will come at a later stage.
Note: Dalco will come and exchange all the power supplies of the compute nodes on that date.

Nodes migrated to UMD 2, cfengine updated.

The files for the UMD-1 repos are empty, this was done in cfengine to remove the possibilty of accidentily picking up old repos.

WN74-78 were still having issues with installing suspected due to fake RAID setup. Will continue investiagtion once Dalco engineer replaces power supplies.

CLOSED Restart of se13

  • Description: There is a Nagios check that complains about the blocksize of the raids
  • Affected nodes: se13
  • Notes: All parameters are actually OK, only a reboot is enough to make the check stop complaining.
Retarted the dcs3700_tunning service

CLOSED (CANCELLED) Set MTU=4000 on all ethernet nodes

  • Description: We need to set the MTU to 4000 on all ethernet nodes
  • Affected nodes: All VMs (guests and hosts), both 4036E bridges, and the router.
  • Notes: This affects only public, production IPs. All private interfaces should stay on 1500, for simplicity (we may need to adjust the nagios check).
    • Virtual guests
    • Virtual hosts
    • Bridges and router DONE
    • Nagios checks
Note: The limit of 4096 (probably 4092) is the Infiniband limit, that has MTU of 65k only on Connected Mode, not available in Ethernet.

DONE Upgrade to latest CVMFS/squid.

  • Description: Requested from WLCG, before end of April
  • Affected nodes: WNs.
  • Notes: On clients

DONE Upgrade Java on all dCache nodes

  • Description: Java version installed in new pools (and other nodes) is too old. We need to homogenize the version we use.
  • Affected nodes: se[01-14], storage[01-02]
  • Notes: Search for the newest and apply it everywhere.
    • storage01-02 DONE
Required openjdk 1.7, changed symbolic link in /usr/bin/java to /etc/alternatives/java

DONE Upgrade dCache on storage[01,02]

  • Description: There is a newer version, and since we're having problems with SRM, it may be a good idea to upgrade it there.
  • Affected nodes: storage[01-02]
  • Notes: This was decided at a late stage

DONE Restart IB Switch

  • Description: swib8 need to be restarted to pick up the right name
  • Affected nodes: all
  • Notes:

DONE Check IB cables

  • Description: There are some links that create trouble
  • Affected nodes: check
  • Notes: There are a couple of nodes with problems on the cables. Also, change IB card on oss42 (check)
This did not seem to be a problem, but only Puppet, and the network card was replaced.

CLOSED Firmware upgrade on DS3500 controllers

  • Description: Need to upgrade controllers to version 07.83
  • Affected nodes: se[01-08]
  • Notes: Together with disk FW, so need to stop all IO (WARNING!) (problem with disk FW, ticket opened with IBM. IO tests are good, seems like it only affects the FW upgrade itself)
    • Storage-1 DONE
    • Storage-2 DONE
    • Storage-3 DONE
    • Storage-4 DONE
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r15 - 2013-04-10 - GeorgeBrown
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback