Symptoms
Summary: Machine unbootable due to patch 137138-09
Occurrences
At what times did this problem occur (used to estimate frequency):
Observations
On two x4500 thumpers, applying the patch 137138-09 after reboot (or singel user mode) resulted in a corrupt kernel and a corrupted filesystem.
The last lines from t3fs04 for the update process were
Installing updatesInstalling update 127128-11 Succeeded
Installing update 137138-09
I had initiated a "init 0" on the system console as recommended by the update process. A remote ssh session had remained alive in both cases. The redirected system console stopped working. The machine could not be shut down in a clean way. Needed a forceful shutdown! I had made sure that no smpatch related processes were running any more (and waited 2 hours in the second case). The machine was not able to boot and the kernel messages pointed to unresolved symbols! What an utter mess.
Solution or Workaround
Sun themselves have issued some warnings about this patch. They document problems for two system configurations, but ours as well as others on the net have experienced problems with other setups. The issue has been known since November, but unbelievably, SUN has not pulled this patch back or fixed it.
I tried some tips from
patch 137138-09 discussion:
Boot in failsafe mode (From the similarly configured machine t3fs01 I saw that the boot device is c5t0d0p0)
# mkdir /var/tmp/mnt
# mount -F ufs /dev/dsk/c5t0d0s0 /var/tmp/mnt
# bootadm update-archive -R /var/tmp/mnt
Creating boot_archive for /tmp/root/var/tmp/mnt
updating /tmp/root/var/tmp/mnt/platform/i86pc/boot_archive
#sync
#umount /var/tmp/mnt
#reboot # not into failsafe, but into normal kernel
NOTICE: /: unexpected free inode 99933, run fsck(1M) -o f
WARNING: /: unexpected allocated inode 104911, run fsck(1M) -o f
I shut the system down and left. Next day I discovered that it had been hanging all the time at the boot prompt.
Booting of the system failed now:
root (hd0,0,a)
Filesystem type is ufs, partition type 0xbf
kernel /platform/i86pc/multiboot
[Multiboot-elf, <0x1000000:0x141eb:0x128f5>, shtab=0x1027258, entry=0x100000
0]
module /platform/i86pc/boot_archive
Error 28: Selected item cannot fit into memory
Booting 'Solaris 10 11/06 s10x_u3wos_10 X86'
I booted in failsafe mode and did an fsck, which took quite some time with lots of messages
# fsck -y -F ufs /dev/dsk/c5t0d0s0
# reboot
trying to boot into the normal system at the boot prompt froze the console screen
the boot image seems to have been destroyed again... ok
#mount -F ufs /dev/dsk/c5t0d0s0 /mnt
#bootadm update-archive -R /mnt
Creating boot_archive for /mnt
updating /mnt/platform/i86pc/boot_archive
#sync
#reboot # .... the system came up again with some errors
SunOS Release 5.10 Version Generic_137138-09 64-bit
Copyright 1983-2008 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
e1000g0: DL_BIND_REQ failed: DL_SYSERR (errno 16)
e1000g0: DL_UNBIND_REQ failed: DL_OUTSTATE
Failed to plumb IPv4 interface(s): e1000g0
again booted in failsafe mode, did same fsck multiple times until no more errors showed up
# fsck -y -F ufs /dev/dsk/c5t0d0s0
..
# mount -F ufs /dev/dsk/c5t0d0s0 /mnt
# bootadm update-archive -R /mnt
panic[cpu3]/thread=bcda5800: alloccgblk: can't find blk in cyl, pos:0, i:380, fs:/mnt bno: 300
ae5b7b54 genunix:vcmn_err+13 (3, feca0900, ae5b7b)
ae5b7b74 ufs:real_panic_v+47 (0, feca0900, ae5b7b)
ae5b7b9c ufs:ufs_fault_v+19f (bc652f00, feca0900,)
ae5b7bb0 ufs:ufs_fault+12 (bc652f00, feca0900,)
ae5b7c08 ufs:alloccgblk+28f (b3fd4500, bc6f7000,)
ae5b7c50 ufs:alloccg+3f3 (bc75ab18, 11, ce320)
ae5b7c7c ufs:hashalloc+2b (bc75ab18, 11, ce320)
ae5b7cbc ufs:alloc+120 (bc75ab18, ce320, 20)
ae5b7dac ufs:bmap_write+a9a (bc75ab18, 8000, 0, )
ae5b7e68 ufs:wrip+397 (bc75ab18, ae5b7f3c,)
ae5b7ecc ufs:ufs_write+492 (ae2290c0, ae5b7f3c,)
ae5b7f04 genunix:fop_write+2a (ae2290c0, ae5b7f3c,)
ae5b7f84 genunix:write+29a (4, 80840f8, 148cc, )
syncing file systems... [1] 1 [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] [1] done (not all i/o completed)
skipping system dump - no dump device configured
rebooting...
I again booted into failsafe and did the fsck, mount, bootarchive sequence again. Now the bootarchive writing worked. The system came up halfway with some network related errors as before, but again, I never got a login prompt.
I was not able to save the installation.
--
DerekFeichtinger - 20 Dec 2008