Tags:
create new tag
view all tags

ILOM upgrade problem of X4150 machines 2009-05-27

Upgrade from

Also refer to the last update (FirmwareBiosUpdate)

SP firmware 2.0.2.6
SP firmware build number: 35128
SP firmware date: Mon Jul 28 10:17:35 PDT 2008
SP filesystem version: 0.1.16

upgrade procedure

I follow the general instructions from my earlier FirmwareBiosUpdate:

Upgrade was downloaded from http://www.sun.com/servers/x64/x4150/downloads.jsp

SUN's information on this upgrade: Sun Fire X4150/X4250 Server Tools & Drivers DVD 3.0.0 Image Contents

  • BIOS is updated to version 1ADQW060 (ILOM)
  • SP is updated to Version 3.0.3.30 (ILOM) build r44534
  • Tools and drivers are updated
  • Bug fixes

Copy ILOM update package to tftpd server area on t3admin01:

cp /root/BIOS-FW-update/X4150_X4250_300/remoteflash/ilom/.0.0/ilom.X4150_X4250-3.0.3.30.r44534.pkg /tftpboot/

  • Shutting down the system
  • first reset the ILOM, since the old version had issues with processes dying because of memory shortage
  • log in to the nodes ILOM and issue the following command
 
load -source tftp://192.168.2.21/ilom.X4150_X4250-3.0.3.30.r44534.pkg

initial test on t3wn01. Temporary boot problem

initial testing on t3wn01. OK. ILOM comes back with

Sun(TM) Integrated Lights Out Manager

Version 3.0.3.30

But starting the system somehow failed the first time and in the SP log I see strange stuff

518    Tue May 26 17:52:30 2009  IPMI      Log       critical
       ID =  2e3 : 05/26/2009 : 17:52:30 : System Boot Initiated : BIOS : Initia
       ted by power up
517    Tue May 26 17:52:30 2009  Audit     Log       minor
       KCS Command : Set Watchdog Timer : timer user = 0x2 : timer actions = 0x0
        : pre-timeout interval = 0 : expiration flags = 0x0 : initial count down
        value = 65535 : success
516    Tue May 26 17:51:49 2009  IPMI      Log       critical
       ID =  2e2 : 05/26/2009 : 17:51:49 : System Firmware Error : BIOS : CMOS c
       hecksum failure
515    Tue May 26 17:51:03 2009  IPMI      Log       minor
       ID =  2e1 : 05/26/2009 : 17:51:03 : System Firmware Progress : BIOS : Opt
       ion ROM initialization
The second try to start the system worked.

test on t3wn03. temporary boot problem and detailed logs

On t3wn03 I see a similar behavior. The first start after the update failed, and produced the following entries:

269    Wed May 27 08:37:45 2009  Audit     Log       minor
       root : Set : object = /SYS/power_state : value = on : success
268    Wed May 27 08:37:45 2009  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = 0x0 : device po
       wer state = no change : success

The second start produced this (observe the CMOS checksum error! This might be non-critical since the checksum differs due to the new BIOS's default settings?): Show Hide

291    Wed May 27 10:19:22 2009  IPMI      Log       minor
       ID =   b6 : 05/27/2009 : 10:19:22 : System Firmware Progress : BIOS : Sys
       tem boot initiated
290    Wed May 27 10:18:31 2009  IPMI      Log       minor
       ID =   b5 : 05/27/2009 : 10:18:31 : System Firmware Progress : BIOS : Opt
       ion ROM initialization
289    Wed May 27 10:18:19 2009  IPMI      Log       minor
       ID =   b4 : 05/27/2009 : 10:18:19 : System Firmware Progress : BIOS : Vid
       eo initialization
288    Wed May 27 10:18:12 2009  Audit     Log       minor
       KCS Command : Set Serial/Modem Mux : channel number = 0 : MUX setting get
        MUX setting : success
287    Wed May 27 10:18:12 2009  Audit     Log       minor
       KCS Command : Set SEL Time : time value = 0x4A1D1364 : success
286    Wed May 27 10:18:10 2009  IPMI      Log       minor
       ID =   b3 : 05/27/2009 : 10:18:10 : System Firmware Progress : BIOS : Sec
       ondary CPU Initialization
285    Wed May 27 10:18:10 2009  IPMI      Log       minor
       ID =   b2 : 05/27/2009 : 10:18:10 : System Firmware Progress : BIOS : Pri
       mary CPU initialization
284    Wed May 27 10:18:09 2009  IPMI      Log       critical
       ID =   b1 : 05/27/2009 : 10:18:09 : System Boot Initiated : BIOS : Initia
       ted by power up
283    Wed May 27 10:18:09 2009  Audit     Log       minor
       KCS Command : Set Watchdog Timer : timer user = 0x2 : timer actions = 0x0
        : pre-timeout interval = 0 : expiration flags = 0x0 : initial count down
        value = 65535 : success
282    Wed May 27 10:17:28 2009  IPMI      Log       critical
       ID =   b0 : 05/27/2009 : 10:17:28 : System Firmware Error : BIOS : CMOS c
       hecksum failure
281    Wed May 27 10:16:42 2009  IPMI      Log       minor
       ID =   af : 05/27/2009 : 10:16:42 : System Firmware Progress : BIOS : Opt
       ion ROM initialization
280    Wed May 27 10:16:31 2009  IPMI      Log       minor
       ID =   ae : 05/27/2009 : 10:16:31 : System Firmware Progress : BIOS : Vid
       eo initialization
279    Wed May 27 10:16:23 2009  Audit     Log       minor
       KCS Command : Set Serial/Modem Mux : channel number = 0 : MUX setting get
        MUX setting : success
278    Wed May 27 10:16:23 2009  Audit     Log       minor
       KCS Command : Set SEL Time : time value = 0x4A1D12F7 : success
277    Wed May 27 08:41:36 2009  IPMI      Log       minor
       ID =   ad : 05/27/2009 : 08:41:36 : System Firmware Progress : BIOS : Sec
       ondary CPU Initialization
276    Wed May 27 08:41:36 2009  IPMI      Log       minor
       ID =   ac : 05/27/2009 : 08:41:36 : System Firmware Progress : BIOS : Pri
       mary CPU initialization
275    Wed May 27 08:41:35 2009  IPMI      Log       critical
       ID =   ab : 05/27/2009 : 08:41:35 : System Boot Initiated : BIOS : Initia
       ted by power up
274    Wed May 27 08:41:34 2009  Audit     Log       minor
       KCS Command : Set Watchdog Timer : timer user = 0x2 : timer actions = 0x0
        : pre-timeout interval = 0 : expiration flags = 0x0 : initial count down
        value = 65535 : success
273    Wed May 27 08:41:04 2009  IPMI      Log       critical
       ID =   aa : 05/27/2009 : 08:41:04 : System ACPI Power State : ACPI : S5/G
       2: soft-off
272    Wed May 27 08:41:04 2009  IPMI      Log       critical
       ID =   a9 : 05/27/2009 : 08:41:04 : System ACPI Power State : ACPI : S0/G
       0: working
271    Wed May 27 08:40:55 2009  Audit     Log       minor
       root : Set : object = /SYS/power_state : value = on : success
270    Wed May 27 08:40:55 2009  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = 0x0 : device po
       wer state = no change : success

I stopped the system by issueing poweroff from the OS. The redirected console froze immediately. Even after a few minutes show /SYS still showed the system as being on. So, the shutting down of the machine seems not to have worked. LEDs are green. There is only one SP log entry:

292    Wed May 27 10:29:46 2009  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = 0x5 : device po
       wer state = no change : success

I did a soft system stop from the ILOM using stop /SYS which had no effect:

294    Wed May 27 10:40:08 2009  Audit     Log       minor
       root : Set : object = /SYS/power_state : value = soft : success
293    Wed May 27 10:40:08 2009  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = 0x5 : device po
       wer state = no change : success

Then I finally used stop -force /SYS which turned the system off.

298    Wed May 27 10:42:29 2009  IPMI      Log       critical
       ID =   b8 : 05/27/2009 : 10:42:29 : System ACPI Power State : ACPI : S5/G
       2: soft-off
297    Wed May 27 10:42:29 2009  IPMI      Log       critical
       ID =   b7 : 05/27/2009 : 10:42:29 : System ACPI Power State : ACPI : S0/G
       0: working
296    Wed May 27 10:42:22 2009  Audit     Log       minor
       root : Set : object = /SYS/power_state : value = off : success
295    Wed May 27 10:42:22 2009  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = 0x7 : device po
       wer state = no change : success

Again doing start /SYS brought the machine up: Show Hide

311    Wed May 27 10:46:17 2009  IPMI      Log       minor
       ID =   c0 : 05/27/2009 : 10:46:17 : System Firmware Progress : BIOS : Sys
       tem boot initiated
310    Wed May 27 10:45:26 2009  IPMI      Log       minor
       ID =   bf : 05/27/2009 : 10:45:26 : System Firmware Progress : BIOS : Opt
       ion ROM initialization
309    Wed May 27 10:45:14 2009  IPMI      Log       minor
       ID =   be : 05/27/2009 : 10:45:14 : System Firmware Progress : BIOS : Vid
       eo initialization
308    Wed May 27 10:45:06 2009  Audit     Log       minor
       KCS Command : Set Serial/Modem Mux : channel number = 0 : MUX setting get
        MUX setting : success
307    Wed May 27 10:45:06 2009  Audit     Log       minor
       KCS Command : Set SEL Time : time value = 0x4A1D19B2 : success
306    Wed May 27 10:45:05 2009  IPMI      Log       minor
       ID =   bd : 05/27/2009 : 10:45:05 : System Firmware Progress : BIOS : Sec
       ondary CPU Initialization
305    Wed May 27 10:45:05 2009  IPMI      Log       minor
       ID =   bc : 05/27/2009 : 10:45:05 : System Firmware Progress : BIOS : Pri
       mary CPU initialization
304    Wed May 27 10:45:03 2009  IPMI      Log       critical
       ID =   bb : 05/27/2009 : 10:45:03 : System Boot Initiated : BIOS : Initia
       ted by power up
303    Wed May 27 10:45:03 2009  Audit     Log       minor
       KCS Command : Set Watchdog Timer : timer user = 0x2 : timer actions = 0x0
        : pre-timeout interval = 0 : expiration flags = 0x0 : initial count down
        value = 65535 : success
302    Wed May 27 10:44:35 2009  IPMI      Log       critical
       ID =   ba : 05/27/2009 : 10:44:35 : System ACPI Power State : ACPI : S5/G
       2: soft-off
301    Wed May 27 10:44:35 2009  IPMI      Log       critical
       ID =   b9 : 05/27/2009 : 10:44:35 : System ACPI Power State : ACPI : S0/G
       0: working
300    Wed May 27 10:44:23 2009  Audit     Log       minor
       root : Set : object = /SYS/power_state : value = on : success
299    Wed May 27 10:44:23 2009  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = 0x0 : device po
       wer state = no change : success

Issueing a reboot from the OS seems to correctly have done a cycle. Immediately after issueing the command, I lose the console redirection (non-responsive), and I need to start /SP/console again to see the boot sequence. Show Hide

320    Wed May 27 11:00:27 2009  IPMI      Log       minor
       ID =   c6 : 05/27/2009 : 11:00:27 : System Firmware Progress : BIOS : Sys
       tem boot initiated
319    Wed May 27 10:59:37 2009  IPMI      Log       minor
       ID =   c5 : 05/27/2009 : 10:59:37 : System Firmware Progress : BIOS : Opt
       ion ROM initialization
318    Wed May 27 10:59:25 2009  IPMI      Log       minor
       ID =   c4 : 05/27/2009 : 10:59:25 : System Firmware Progress : BIOS : Vid
       eo initialization
317    Wed May 27 10:59:18 2009  Audit     Log       minor
       KCS Command : Set Serial/Modem Mux : channel number = 0 : MUX setting get
        MUX setting : success
316    Wed May 27 10:59:18 2009  Audit     Log       minor
       KCS Command : Set SEL Time : time value = 0x4A1D1D06 : success
315    Wed May 27 10:59:15 2009  IPMI      Log       minor
       ID =   c3 : 05/27/2009 : 10:59:15 : System Firmware Progress : BIOS : Sec
       ondary CPU Initialization
314    Wed May 27 10:59:15 2009  IPMI      Log       minor
       ID =   c2 : 05/27/2009 : 10:59:15 : System Firmware Progress : BIOS : Pri
       mary CPU initialization
313    Wed May 27 10:59:13 2009  IPMI      Log       critical
       ID =   c1 : 05/27/2009 : 10:59:13 : System Boot Initiated : BIOS : Initia
       ted by power up
312    Wed May 27 10:59:13 2009  Audit     Log       minor
       KCS Command : Set Watchdog Timer : timer user = 0x2 : timer actions = 0x0
        : pre-timeout interval = 0 : expiration flags = 0x0 : initial count down
        value = 65535 : success

Doing another poweroff from the OS correctly brought the machine down, and this time show /SYS also showed it as off.

323    Wed May 27 11:08:44 2009  IPMI      Log       critical
       ID =   c8 : 05/27/2009 : 11:08:44 : System ACPI Power State : ACPI : S5/G
       2: soft-off
322    Wed May 27 11:08:44 2009  IPMI      Log       critical
       ID =   c7 : 05/27/2009 : 11:08:44 : System ACPI Power State : ACPI : S0/G
       0: working
321    Wed May 27 11:08:43 2009  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = 0x5 : device po
       wer state = no change : success

t3se01 - CLI misses all /SYS entries

Firmware update is complete.
ILOM will now be restarted with the new firmware.

-> /sbin/reboot
Read from remote host rmse01: Connection reset by peer
Connection to rmse01 closed.

However, when the ILOM came up, there was no /SYS entry, but only /SP and /HOST

-> show

 /
    Targets:
        HOST
        SP

In the /SP/logs I find lots of I2C bus(0) seems to be faulty errors: Show Hide

265    Mon Jun  1 18:15:17 2009  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
264    Mon Jun  1 18:14:48 2009  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
263    Mon Jun  1 18:14:47 2009  Audit     Log       minor
       root : Close Session : object = /session/type : value = shell : success
262    Mon Jun  1 18:14:47 2009  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
261    Mon Jun  1 18:14:22 2009  IPMI      Log       critical
       ID =   91 : 06/01/2009 : 18:14:22 : System ACPI Power State : ACPI : S5/G
       2: soft-off
260    Mon Jun  1 18:18:27 2009  System    Log       critical
       SP is about to reboot
259    Mon Jun  1 18:18:26 2009  System    Log       critical
       upgrade to version 3.0.3.30 succeeded
258    Mon Jun  1 18:13:38 2009  IPMI      Log       critical
       ID =   90 : 06/01/2009 : 18:13:38 : System ACPI Power State : ACPI : S5/G
       2: soft-off
257    Mon Jun  1 18:11:50 2009  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
256    Sat May 30 13:04:04 2009  Audit     Log       minor
       root : Close Session : object = /session/type : value = shell : success
255    Sat May 30 13:03:51 2009  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
254    Sat May 30 13:03:41 2009  IPMI      Log       critical
       ID =   8f : 05/30/2009 : 13:03:41 : System ACPI Power State : ACPI : S0/G
       0: working
253    Sat May 30 13:03:19 2009  IPMI      Log       critical
       ID =   8e : 05/30/2009 : 13:03:19 : System ACPI Power State : ACPI : S5/G
       2: soft-off
252    Sat May 30 13:01:25 2009  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
251    Fri May  8 09:37:47 2009  Audit     Log       minor
       root : Close Session : object = /session/type : value = shell : success
250    Fri May  8 09:34:15 2009  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
249    Thu May  7 18:39:35 2009  System    Log       critical
       I2C bus(0) seems to have recovered from the faulty state
248    Thu May  7 18:37:44 2009  System    Log       critical
       I2C bus(0) seems to be faulty
247    Thu May  7 18:35:28 2009  System    Log       critical
       I2C bus(0) seems to have recovered from the faulty state
246    Thu May  7 18:33:15 2009  System    Log       critical
       I2C bus(0) seems to be faulty
245    Thu May  7 18:31:46 2009  System    Log       critical
       I2C bus(0) seems to have recovered from the faulty state
244    Thu May  7 18:29:34 2009  System    Log       critical
       I2C bus(0) seems to be faulty
243    Thu May  7 18:29:11 2009  IPMI      Log       critical
       ID =   8d : 05/07/2009 : 18:29:11 : System ACPI Power State : ACPI : S0/G
       0: working
242    Thu May  7 18:27:07 2009  System    Log       critical
       I2C bus(0) seems to have recovered from the faulty state
241    Thu May  7 18:25:16 2009  System    Log       critical
       I2C bus(0) seems to be faulty
240    Thu May  7 18:24:14 2009  IPMI      Log       critical
       ID =   8c : 05/07/2009 : 18:24:13 : System ACPI Power State : ACPI : S5/G
       2: soft-off
239    Thu May  7 18:23:04 2009  System    Log       critical
       I2C bus(0) seems to have recovered from the faulty state
238    Thu May  7 18:21:13 2009  System    Log       critical
       I2C bus(0) seems to be faulty
237    Thu May  7 18:19:45 2009  System    Log       critical
       I2C bus(0) seems to have recovered from the faulty state
236    Thu May  7 18:17:32 2009  System    Log       critical
       I2C bus(0) seems to be faulty
..
...
223    Thu May  7 17:59:00 2009  System    Log       critical
       I2C bus(0) seems to be faulty
222    Thu May  7 17:57:46 2009  IPMI      Log       critical
       ID =   8a : 05/07/2009 : 17:57:46 : System ACPI Power State : ACPI : S5/G
       2: soft-off
221    Thu May  7 17:56:39 2009  IPMI      Log       critical
       ID =   89 : 05/07/2009 : 17:56:39 : System ACPI Power State : ACPI : S5/G
       2: soft-off
220    Thu May  7 17:55:06 2009  IPMI      Log       critical
       ID =   88 : 05/07/2009 : 17:55:06 : System ACPI Power State : ACPI : S5/G
       2: soft-off
219    Thu May  7 17:53:53 2009  IPMI      Log       critical
       ID =   87 : 05/07/2009 : 17:53:53 : System ACPI Power State : ACPI : S5/G
       2: soft-off
218    Thu May  7 17:51:36 2009  IPMI      Log       critical
       ID =   86 : 05/07/2009 : 17:51:36 : System ACPI Power State : ACPI : S0/G
       0: working
217    Thu May  7 17:51:31 2009  IPMI      Log       critical
       ID =   85 : 05/07/2009 : 17:51:30 : System ACPI Power State : ACPI : S5/G
       2: soft-off

The /SP cannot be reset any more via the CLI:

> reset /SP
Are you sure you want to reset /SP (y/n)? y
Performing reset on /SP
Performing reset on /SP failed
reset: No such object

The HTTP interface still seems to give full access to the /SYS. I was able to start it this way, and I also was able to reset the /SP via the HTTP interface. But even after the reset, the CLI comes up without showing the /SYS tree.

I logged in to the OS and did a poweroff from there. However, judging from the HTTP interface, the node did not come down and I did an "immediate poweroff" from there.

I initiated another SP firmware upgrade with the same firmware from the HTTP interface. The interface showed that the upgrade had correctly worked and the SP went into another reboot.

Still. The CLI came up again without showing the /SYS tree.

I started the OS again through the HTTP interface. /var/log/messages showed that the system seems to have shut down cleanly the last time.

Repaired by doing an ILOM reinstall with deleting configuration

After doing another Firmware reinstall with selecting "set all configuration to defaults" the ILOM seems to behave normally. All /SYS entries are back.

t3ui01

Did the firmware upgrade on June 1st. On the following morning, these service LEDs were turned on (I do not know whether it had already been on before the upgrade).

  • /SYS/MB/MCH/DA0/SERVICE
  • /SYS/SERVICE

I shut down the system from the OS using "poweroff". The ILOM did not show the system as "off" even after several minutes.

I disconnected the power for a minute, then reconnected. The service LEDs were all off now.

After restarting the system, I saw that the LED was again turned on:

/SYS/MB/MCH/DA0/SERVICE

This points to a DIMM problem.

Looking at the OS /var/log/messages, I find indeed some DIMM related messages beginning at May 24 2009:

May 24 17:54:17 t3ui01 kernel: EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x2000
May 24 17:54:17 t3ui01 kernel: EDAC MC0: CE row 0, channel 0, label "": (Branch=0 DRAM-Bank=3 RDWR=Read RAS=811
4 CAS=2852, CE Err=0x2000)

The ILOM event list shows the following entries

  bd | 06/01/2009 | 18:56:18 | Memory | Memory Device Disabled | Asserted | CPU 0 DIMM 0
  c6 | 06/01/2009 | 20:16:42 | Memory | Memory Device Disabled | Asserted | CPU 0 DIMM 0
  d2 | 06/02/2009 | 12:21:47 | Memory | Memory Device Disabled | Asserted | CPU 0 DIMM 0

ILOM CLI passwd DB corrupted on t3wn02 and t3wn07

Cannot log in to ILOM management on rmwn02 and rmwn07 any more. The default password is no longer accepted (but it had never been changed). Logging in to the sunservice account also does not work any more.

I am convinced that this is a result from the old ILOM version's problem that I had discovered a while ago (SunHwContact#2008_01_30_Machines_lose_IPMI_co), where a runaway process on the SP ate up all available memory and caused other ILOM processes (like the IPMI service) to die.

The nodes were powered down from the OS, and then disconnected for a prolonged time from the power. Did not help.

Luckily, the HTTP interface still was accessible with the correct password. From there I did a "set all configuration to defaults (including users)" (in Maintenance/Configuration Management). After a reboot the ILOM had a new SSH key, but I was able to log in again to the CLI with the default password.

-- DerekFeichtinger - 27 May 2009

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r7 - 2009-06-19 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback