ILOM upgrade problem of X4150 machines 2009-05-27
Upgrade from
Also refer to the last update (
FirmwareBiosUpdate)
SP firmware 2.0.2.6
SP firmware build number: 35128
SP firmware date: Mon Jul 28 10:17:35 PDT 2008
SP filesystem version: 0.1.16
upgrade procedure
I follow the general instructions from my earlier
FirmwareBiosUpdate:
Upgrade was downloaded from
http://www.sun.com/servers/x64/x4150/downloads.jsp
SUN's information on this upgrade:
Sun Fire X4150/X4250 Server Tools & Drivers DVD 3.0.0 Image Contents
- BIOS is updated to version 1ADQW060 (ILOM)
- SP is updated to Version 3.0.3.30 (ILOM) build r44534
- Tools and drivers are updated
- Bug fixes
Copy ILOM update package to tftpd server area on t3admin01:
cp /root/BIOS-FW-update/X4150_X4250_300/remoteflash/ilom/.0.0/ilom.X4150_X4250-3.0.3.30.r44534.pkg /tftpboot/
- Shutting down the system
- first reset the ILOM, since the old version had issues with processes dying because of memory shortage
- log in to the nodes ILOM and issue the following command
load -source tftp://192.168.2.21/ilom.X4150_X4250-3.0.3.30.r44534.pkg
initial test on t3wn01. Temporary boot problem
initial testing on t3wn01. OK. ILOM comes back with
Sun(TM) Integrated Lights Out Manager
Version 3.0.3.30
But starting the system somehow failed the first time and in the SP log I see strange stuff
518 Tue May 26 17:52:30 2009 IPMI Log critical
ID = 2e3 : 05/26/2009 : 17:52:30 : System Boot Initiated : BIOS : Initia
ted by power up
517 Tue May 26 17:52:30 2009 Audit Log minor
KCS Command : Set Watchdog Timer : timer user = 0x2 : timer actions = 0x0
: pre-timeout interval = 0 : expiration flags = 0x0 : initial count down
value = 65535 : success
516 Tue May 26 17:51:49 2009 IPMI Log critical
ID = 2e2 : 05/26/2009 : 17:51:49 : System Firmware Error : BIOS : CMOS c
hecksum failure
515 Tue May 26 17:51:03 2009 IPMI Log minor
ID = 2e1 : 05/26/2009 : 17:51:03 : System Firmware Progress : BIOS : Opt
ion ROM initialization
The second try to start the system worked.
test on t3wn03. temporary boot problem and detailed logs
On t3wn03 I see a similar behavior.
The first start after the update failed, and produced the following entries:
269 Wed May 27 08:37:45 2009 Audit Log minor
root : Set : object = /SYS/power_state : value = on : success
268 Wed May 27 08:37:45 2009 Audit Log minor
KCS Command : Set ACPI Power State : system power state = 0x0 : device po
wer state = no change : success
The second start produced this (
observe the CMOS checksum error! This might be non-critical since the checksum differs due to the new BIOS's default settings?):
Show Hide
291 Wed May 27 10:19:22 2009 IPMI Log minor
ID = b6 : 05/27/2009 : 10:19:22 : System Firmware Progress : BIOS : Sys
tem boot initiated
290 Wed May 27 10:18:31 2009 IPMI Log minor
ID = b5 : 05/27/2009 : 10:18:31 : System Firmware Progress : BIOS : Opt
ion ROM initialization
289 Wed May 27 10:18:19 2009 IPMI Log minor
ID = b4 : 05/27/2009 : 10:18:19 : System Firmware Progress : BIOS : Vid
eo initialization
288 Wed May 27 10:18:12 2009 Audit Log minor
KCS Command : Set Serial/Modem Mux : channel number = 0 : MUX setting get
MUX setting : success
287 Wed May 27 10:18:12 2009 Audit Log minor
KCS Command : Set SEL Time : time value = 0x4A1D1364 : success
286 Wed May 27 10:18:10 2009 IPMI Log minor
ID = b3 : 05/27/2009 : 10:18:10 : System Firmware Progress : BIOS : Sec
ondary CPU Initialization
285 Wed May 27 10:18:10 2009 IPMI Log minor
ID = b2 : 05/27/2009 : 10:18:10 : System Firmware Progress : BIOS : Pri
mary CPU initialization
284 Wed May 27 10:18:09 2009 IPMI Log critical
ID = b1 : 05/27/2009 : 10:18:09 : System Boot Initiated : BIOS : Initia
ted by power up
283 Wed May 27 10:18:09 2009 Audit Log minor
KCS Command : Set Watchdog Timer : timer user = 0x2 : timer actions = 0x0
: pre-timeout interval = 0 : expiration flags = 0x0 : initial count down
value = 65535 : success
282 Wed May 27 10:17:28 2009 IPMI Log critical
ID = b0 : 05/27/2009 : 10:17:28 : System Firmware Error : BIOS : CMOS c
hecksum failure
281 Wed May 27 10:16:42 2009 IPMI Log minor
ID = af : 05/27/2009 : 10:16:42 : System Firmware Progress : BIOS : Opt
ion ROM initialization
280 Wed May 27 10:16:31 2009 IPMI Log minor
ID = ae : 05/27/2009 : 10:16:31 : System Firmware Progress : BIOS : Vid
eo initialization
279 Wed May 27 10:16:23 2009 Audit Log minor
KCS Command : Set Serial/Modem Mux : channel number = 0 : MUX setting get
MUX setting : success
278 Wed May 27 10:16:23 2009 Audit Log minor
KCS Command : Set SEL Time : time value = 0x4A1D12F7 : success
277 Wed May 27 08:41:36 2009 IPMI Log minor
ID = ad : 05/27/2009 : 08:41:36 : System Firmware Progress : BIOS : Sec
ondary CPU Initialization
276 Wed May 27 08:41:36 2009 IPMI Log minor
ID = ac : 05/27/2009 : 08:41:36 : System Firmware Progress : BIOS : Pri
mary CPU initialization
275 Wed May 27 08:41:35 2009 IPMI Log critical
ID = ab : 05/27/2009 : 08:41:35 : System Boot Initiated : BIOS : Initia
ted by power up
274 Wed May 27 08:41:34 2009 Audit Log minor
KCS Command : Set Watchdog Timer : timer user = 0x2 : timer actions = 0x0
: pre-timeout interval = 0 : expiration flags = 0x0 : initial count down
value = 65535 : success
273 Wed May 27 08:41:04 2009 IPMI Log critical
ID = aa : 05/27/2009 : 08:41:04 : System ACPI Power State : ACPI : S5/G
2: soft-off
272 Wed May 27 08:41:04 2009 IPMI Log critical
ID = a9 : 05/27/2009 : 08:41:04 : System ACPI Power State : ACPI : S0/G
0: working
271 Wed May 27 08:40:55 2009 Audit Log minor
root : Set : object = /SYS/power_state : value = on : success
270 Wed May 27 08:40:55 2009 Audit Log minor
KCS Command : Set ACPI Power State : system power state = 0x0 : device po
wer state = no change : success
I stopped the system by issueing
poweroff
from the OS. The redirected console froze immediately.
Even after a few minutes show /SYS
still showed the system as being on
. So, the shutting down of the machine seems not to have worked. LEDs are green. There is only one SP log entry:
292 Wed May 27 10:29:46 2009 Audit Log minor
KCS Command : Set ACPI Power State : system power state = 0x5 : device po
wer state = no change : success
I did a soft system stop from the ILOM using
stop /SYS
which had no effect:
294 Wed May 27 10:40:08 2009 Audit Log minor
root : Set : object = /SYS/power_state : value = soft : success
293 Wed May 27 10:40:08 2009 Audit Log minor
KCS Command : Set ACPI Power State : system power state = 0x5 : device po
wer state = no change : success
Then I finally used
stop -force /SYS
which turned the system off.
298 Wed May 27 10:42:29 2009 IPMI Log critical
ID = b8 : 05/27/2009 : 10:42:29 : System ACPI Power State : ACPI : S5/G
2: soft-off
297 Wed May 27 10:42:29 2009 IPMI Log critical
ID = b7 : 05/27/2009 : 10:42:29 : System ACPI Power State : ACPI : S0/G
0: working
296 Wed May 27 10:42:22 2009 Audit Log minor
root : Set : object = /SYS/power_state : value = off : success
295 Wed May 27 10:42:22 2009 Audit Log minor
KCS Command : Set ACPI Power State : system power state = 0x7 : device po
wer state = no change : success
Again doing
start /SYS
brought the machine up:
Show Hide
311 Wed May 27 10:46:17 2009 IPMI Log minor
ID = c0 : 05/27/2009 : 10:46:17 : System Firmware Progress : BIOS : Sys
tem boot initiated
310 Wed May 27 10:45:26 2009 IPMI Log minor
ID = bf : 05/27/2009 : 10:45:26 : System Firmware Progress : BIOS : Opt
ion ROM initialization
309 Wed May 27 10:45:14 2009 IPMI Log minor
ID = be : 05/27/2009 : 10:45:14 : System Firmware Progress : BIOS : Vid
eo initialization
308 Wed May 27 10:45:06 2009 Audit Log minor
KCS Command : Set Serial/Modem Mux : channel number = 0 : MUX setting get
MUX setting : success
307 Wed May 27 10:45:06 2009 Audit Log minor
KCS Command : Set SEL Time : time value = 0x4A1D19B2 : success
306 Wed May 27 10:45:05 2009 IPMI Log minor
ID = bd : 05/27/2009 : 10:45:05 : System Firmware Progress : BIOS : Sec
ondary CPU Initialization
305 Wed May 27 10:45:05 2009 IPMI Log minor
ID = bc : 05/27/2009 : 10:45:05 : System Firmware Progress : BIOS : Pri
mary CPU initialization
304 Wed May 27 10:45:03 2009 IPMI Log critical
ID = bb : 05/27/2009 : 10:45:03 : System Boot Initiated : BIOS : Initia
ted by power up
303 Wed May 27 10:45:03 2009 Audit Log minor
KCS Command : Set Watchdog Timer : timer user = 0x2 : timer actions = 0x0
: pre-timeout interval = 0 : expiration flags = 0x0 : initial count down
value = 65535 : success
302 Wed May 27 10:44:35 2009 IPMI Log critical
ID = ba : 05/27/2009 : 10:44:35 : System ACPI Power State : ACPI : S5/G
2: soft-off
301 Wed May 27 10:44:35 2009 IPMI Log critical
ID = b9 : 05/27/2009 : 10:44:35 : System ACPI Power State : ACPI : S0/G
0: working
300 Wed May 27 10:44:23 2009 Audit Log minor
root : Set : object = /SYS/power_state : value = on : success
299 Wed May 27 10:44:23 2009 Audit Log minor
KCS Command : Set ACPI Power State : system power state = 0x0 : device po
wer state = no change : success
Issueing a
reboot
from the OS seems to correctly have done a cycle. Immediately after issueing the command, I lose the console redirection (non-responsive), and I need to
start /SP/console
again to see the boot sequence.
Show Hide
320 Wed May 27 11:00:27 2009 IPMI Log minor
ID = c6 : 05/27/2009 : 11:00:27 : System Firmware Progress : BIOS : Sys
tem boot initiated
319 Wed May 27 10:59:37 2009 IPMI Log minor
ID = c5 : 05/27/2009 : 10:59:37 : System Firmware Progress : BIOS : Opt
ion ROM initialization
318 Wed May 27 10:59:25 2009 IPMI Log minor
ID = c4 : 05/27/2009 : 10:59:25 : System Firmware Progress : BIOS : Vid
eo initialization
317 Wed May 27 10:59:18 2009 Audit Log minor
KCS Command : Set Serial/Modem Mux : channel number = 0 : MUX setting get
MUX setting : success
316 Wed May 27 10:59:18 2009 Audit Log minor
KCS Command : Set SEL Time : time value = 0x4A1D1D06 : success
315 Wed May 27 10:59:15 2009 IPMI Log minor
ID = c3 : 05/27/2009 : 10:59:15 : System Firmware Progress : BIOS : Sec
ondary CPU Initialization
314 Wed May 27 10:59:15 2009 IPMI Log minor
ID = c2 : 05/27/2009 : 10:59:15 : System Firmware Progress : BIOS : Pri
mary CPU initialization
313 Wed May 27 10:59:13 2009 IPMI Log critical
ID = c1 : 05/27/2009 : 10:59:13 : System Boot Initiated : BIOS : Initia
ted by power up
312 Wed May 27 10:59:13 2009 Audit Log minor
KCS Command : Set Watchdog Timer : timer user = 0x2 : timer actions = 0x0
: pre-timeout interval = 0 : expiration flags = 0x0 : initial count down
value = 65535 : success
Doing another
poweroff
from the OS correctly brought the machine down, and this time
show /SYS
also showed it as off.
323 Wed May 27 11:08:44 2009 IPMI Log critical
ID = c8 : 05/27/2009 : 11:08:44 : System ACPI Power State : ACPI : S5/G
2: soft-off
322 Wed May 27 11:08:44 2009 IPMI Log critical
ID = c7 : 05/27/2009 : 11:08:44 : System ACPI Power State : ACPI : S0/G
0: working
321 Wed May 27 11:08:43 2009 Audit Log minor
KCS Command : Set ACPI Power State : system power state = 0x5 : device po
wer state = no change : success
t3se01 - CLI misses all /SYS entries
Firmware update is complete.
ILOM will now be restarted with the new firmware.
-> /sbin/reboot
Read from remote host rmse01: Connection reset by peer
Connection to rmse01 closed.
However, when the ILOM came up, there was no /SYS entry, but only /SP and /HOST
-> show
/
Targets:
HOST
SP
In the /SP/logs I find lots of
I2C bus(0) seems to be faulty
errors:
Show Hide
265 Mon Jun 1 18:15:17 2009 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
264 Mon Jun 1 18:14:48 2009 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
263 Mon Jun 1 18:14:47 2009 Audit Log minor
root : Close Session : object = /session/type : value = shell : success
262 Mon Jun 1 18:14:47 2009 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
261 Mon Jun 1 18:14:22 2009 IPMI Log critical
ID = 91 : 06/01/2009 : 18:14:22 : System ACPI Power State : ACPI : S5/G
2: soft-off
260 Mon Jun 1 18:18:27 2009 System Log critical
SP is about to reboot
259 Mon Jun 1 18:18:26 2009 System Log critical
upgrade to version 3.0.3.30 succeeded
258 Mon Jun 1 18:13:38 2009 IPMI Log critical
ID = 90 : 06/01/2009 : 18:13:38 : System ACPI Power State : ACPI : S5/G
2: soft-off
257 Mon Jun 1 18:11:50 2009 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
256 Sat May 30 13:04:04 2009 Audit Log minor
root : Close Session : object = /session/type : value = shell : success
255 Sat May 30 13:03:51 2009 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
254 Sat May 30 13:03:41 2009 IPMI Log critical
ID = 8f : 05/30/2009 : 13:03:41 : System ACPI Power State : ACPI : S0/G
0: working
253 Sat May 30 13:03:19 2009 IPMI Log critical
ID = 8e : 05/30/2009 : 13:03:19 : System ACPI Power State : ACPI : S5/G
2: soft-off
252 Sat May 30 13:01:25 2009 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
251 Fri May 8 09:37:47 2009 Audit Log minor
root : Close Session : object = /session/type : value = shell : success
250 Fri May 8 09:34:15 2009 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
249 Thu May 7 18:39:35 2009 System Log critical
I2C bus(0) seems to have recovered from the faulty state
248 Thu May 7 18:37:44 2009 System Log critical
I2C bus(0) seems to be faulty
247 Thu May 7 18:35:28 2009 System Log critical
I2C bus(0) seems to have recovered from the faulty state
246 Thu May 7 18:33:15 2009 System Log critical
I2C bus(0) seems to be faulty
245 Thu May 7 18:31:46 2009 System Log critical
I2C bus(0) seems to have recovered from the faulty state
244 Thu May 7 18:29:34 2009 System Log critical
I2C bus(0) seems to be faulty
243 Thu May 7 18:29:11 2009 IPMI Log critical
ID = 8d : 05/07/2009 : 18:29:11 : System ACPI Power State : ACPI : S0/G
0: working
242 Thu May 7 18:27:07 2009 System Log critical
I2C bus(0) seems to have recovered from the faulty state
241 Thu May 7 18:25:16 2009 System Log critical
I2C bus(0) seems to be faulty
240 Thu May 7 18:24:14 2009 IPMI Log critical
ID = 8c : 05/07/2009 : 18:24:13 : System ACPI Power State : ACPI : S5/G
2: soft-off
239 Thu May 7 18:23:04 2009 System Log critical
I2C bus(0) seems to have recovered from the faulty state
238 Thu May 7 18:21:13 2009 System Log critical
I2C bus(0) seems to be faulty
237 Thu May 7 18:19:45 2009 System Log critical
I2C bus(0) seems to have recovered from the faulty state
236 Thu May 7 18:17:32 2009 System Log critical
I2C bus(0) seems to be faulty
..
...
223 Thu May 7 17:59:00 2009 System Log critical
I2C bus(0) seems to be faulty
222 Thu May 7 17:57:46 2009 IPMI Log critical
ID = 8a : 05/07/2009 : 17:57:46 : System ACPI Power State : ACPI : S5/G
2: soft-off
221 Thu May 7 17:56:39 2009 IPMI Log critical
ID = 89 : 05/07/2009 : 17:56:39 : System ACPI Power State : ACPI : S5/G
2: soft-off
220 Thu May 7 17:55:06 2009 IPMI Log critical
ID = 88 : 05/07/2009 : 17:55:06 : System ACPI Power State : ACPI : S5/G
2: soft-off
219 Thu May 7 17:53:53 2009 IPMI Log critical
ID = 87 : 05/07/2009 : 17:53:53 : System ACPI Power State : ACPI : S5/G
2: soft-off
218 Thu May 7 17:51:36 2009 IPMI Log critical
ID = 86 : 05/07/2009 : 17:51:36 : System ACPI Power State : ACPI : S0/G
0: working
217 Thu May 7 17:51:31 2009 IPMI Log critical
ID = 85 : 05/07/2009 : 17:51:30 : System ACPI Power State : ACPI : S5/G
2: soft-off
The /SP cannot be reset any more via the CLI:
> reset /SP
Are you sure you want to reset /SP (y/n)? y
Performing reset on /SP
Performing reset on /SP failed
reset: No such object
The HTTP interface still seems to give full access to the /SYS. I was able to start it this way, and I also was able to reset the /SP via the HTTP interface. But even after the reset, the CLI comes up without showing the /SYS tree.
I logged in to the OS and did a poweroff from there. However, judging from the HTTP interface, the node did not come down and I did an "immediate poweroff" from there.
I initiated another SP firmware upgrade with the same firmware from the HTTP interface. The interface showed that the upgrade had correctly worked and the SP went into another reboot.
Still. The CLI came up again without showing the /SYS tree.
I started the OS again through the HTTP interface.
/var/log/messages
showed that the system seems to have shut down cleanly the last time.
Repaired by doing an ILOM reinstall with deleting configuration
After doing another Firmware reinstall with selecting "set all configuration to defaults" the ILOM seems to behave normally. All /SYS entries are back.
t3ui01
Did the firmware upgrade on June 1st. On the following morning, these service LEDs were turned on (I do not know whether it had already been on before the upgrade).
- /SYS/MB/MCH/DA0/SERVICE
- /SYS/SERVICE
I shut down the system from the OS using "poweroff". The ILOM did not show the system as "off" even after several minutes.
I disconnected the power for a minute, then reconnected. The service LEDs were all off now.
After restarting the system, I saw that the LED was again turned on:
/SYS/MB/MCH/DA0/SERVICE
This points to a DIMM problem.
Looking at the OS /var/log/messages, I find indeed some DIMM related messages beginning at May 24 2009:
May 24 17:54:17 t3ui01 kernel: EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x2000
May 24 17:54:17 t3ui01 kernel: EDAC MC0: CE row 0, channel 0, label "": (Branch=0 DRAM-Bank=3 RDWR=Read RAS=811
4 CAS=2852, CE Err=0x2000)
The ILOM event list shows the following entries
bd | 06/01/2009 | 18:56:18 | Memory | Memory Device Disabled | Asserted | CPU 0 DIMM 0
c6 | 06/01/2009 | 20:16:42 | Memory | Memory Device Disabled | Asserted | CPU 0 DIMM 0
d2 | 06/02/2009 | 12:21:47 | Memory | Memory Device Disabled | Asserted | CPU 0 DIMM 0
ILOM CLI passwd DB corrupted on t3wn02 and t3wn07
Cannot log in to ILOM management on rmwn02 and rmwn07 any more. The default password is no longer accepted (but it had never been changed). Logging in to the sunservice account also does not work any more.
I am convinced that this is a result from the old ILOM version's problem that I had discovered a while ago (
SunHwContact#2008_01_30_Machines_lose_IPMI_co), where a runaway process on the SP ate up all available memory and caused other ILOM processes (like the IPMI service) to die.
The nodes were powered down from the OS, and then disconnected for a prolonged time from the power. Did not help.
Luckily, the HTTP interface still was accessible with the correct password. From there I did a "set all configuration to defaults (including users)" (in Maintenance/Configuration Management). After a reboot the ILOM had a new SSH key, but I was able to log in again to the CLI with the default password.
--
DerekFeichtinger - 27 May 2009