Tags:
create new tag
view all tags

Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

25. 11. 2008 Thumper Fileserver t3fs03 problems

Dcache pools on the node were marked as unavailable. A listing of the /data/poolname/data directories on the node hung forever. The var/adm/messages log reported some marvel controller related issues.

Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: DMA comman
d timeout
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03  port 4: device reset
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: device dis
connected or device error
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03  port 4: device reset
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03  port 4: link lost
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03  port 4: link established
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx2: error on port 4:
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info]       device disconnected
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info]       device connected
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@4,0 (sd
26):
Nov 24 20:17:02 t3fs03  Error for Command: read(10)                Error Level: Retryable
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice]    Requested Block: 30705670                  Error Block: 30
705670
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice]    Vendor: ATA                                Serial Number:

Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice]    Sense Key: No Additional Sense
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice]    ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: DMA comman
d timeout
Nov 24 21:22:32 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1:
Nov 24 21:22:32 t3fs03  port 6: device reset
Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: device dis
connected or device error

poweroff failed to shut down the system. I then tried to do a regular shutdown from the ILOM, which also failed

WARNING: Power off requested from power button or SC, powering down the system!

Shutdown started.    Tue Nov 25 19:49:02 CET 2008

WARNING: Failed to shut down the system!

So, I ordered the ILOM to do an immediate poweroff.

These are some of the ILOM entries:

1755   Tue Nov 25 20:56:46 2008  Audit     Log       minor
       root : Set : object = /SYS/power_state : value = off : success
1754   Tue Nov 25 20:56:46 2008  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = no change : dev
       ice power state = no change : success
1753   Tue Nov 25 20:49:37 2008  Audit     Log       minor
       root : Set : object = /SYS/power_state : value = soft : success
1752   Tue Nov 25 20:49:37 2008  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = no change : dev
       ice power state = no change : success

After the restart of machine, start dcache manually.

root@t3fs03 # /opt/d-cache/bin/dcache start
/pnfs/psi.ch/ not mounted - going to mount it now.
/pnfs/psi.ch       (t3dcachedb01.psi.ch:/pnfsdoors):  640000 blocks       -1 files
Starting dcap-t3fs03Domain  Done (pid=1555)
Starting gridftp-t3fs03Domain  Done (pid=1605)
Starting gsidcap-t3fs03Domain  6 Done (pid=1655)
Starting t3fs03Domain  Done (pid=1709)

After this emergency shutdown I was able to reboot the system normally. The dCache pools also came up fine, even though the running jobs were not able to recover (connections stayed in a hanging state with "no movers found" state marked in the active transfers page).

15. 12. 2008 Again t3fs03 problems - OS patching

Again, the pools on t3fs03 went offline and the system was in a strange state. The system log pointed again to the marvell issue. I decided to immediately patch the system. First I again needed to forcefully shut the system down and reboot.

I put this list of patches into a file patch-list-20081215.lst:

125556-01 SunOS 5.10_x86: patch behavior patch
138270-02 SunOS 5.10_x86: devfs patch
127128-11 SunOS 5.10_x86: kernel patch
137140-06 SunOS 5.10_x86: aggr patch
119255-59 SunOS 5.10_x86: Install and Patch Utilities Patch
137122-03 SunOS 5.10_x86: e1000g driver patch
138110-01 SunOS 5.10_x86: ata driver patch

Then I issued the update commands

smpatch update -x idlist=patch-list-20081215.lst

The patching required a system reboot (or at least an init 0). Strangely enough, many more patches than indicated in the original dependency analysis were installed.

root@t3fs03 # init 0
root@t3fs03 # svc.startd: The system is coming down.  Please wait.
svc.startd: 94 system services are now being stopped.
Dec 15 11:35:06 t3fs03 syslogd: going down on signal 15
Dec 15 11:35:07 rpc.metad: Terminated
Installing updatesInstalling update 138270-02 Succeeded
Installing update 127128-11 Succeeded
Installing update 137140-06 Succeeded
Installing update 137122-03 Succeeded
Installing update 138110-01 Succeeded
Installing update 128338-02 Succeeded
Installing update 126207-04 Succeeded
Installing update 127889-10 Succeeded
Installing update 137293-02 Failed
Installing update 121429-10 Failed
Installing update 137112-05 Succeeded
Installing update 138071-03 Succeeded
Installing update 138053-02 Succeeded
Installing update 138307-01 Succeeded
Installing update 120273-23 Succeeded
Installing update 138065-03 Succeeded
Installing update 138061-03 Succeeded
Installing update 137022-02 Succeeded
Installing update 138045-02 Succeeded
Installing update 138043-02 Failed
Installing update 121005-04 Succeeded
Installing update 137290-01 Succeeded
Installing update 128401-05 Succeeded
Installing update 138309-02 Succeeded
Installing update 128307-05 Failed
Installing update 128301-04 Succeeded
Installing update 138091-01 Failed
Installing update 138076-02 Succeeded
Installing update 126654-02 Succeeded
Installing update 138084-01 Succeeded
Installing update 137020-02 Succeeded
Installing update 123896-04 Succeeded
Creating ram disk for /var/run/.patch_root_loopbackmnt
updating /var/run/.patch_root_loopbackmnt/platform/i86pc/boot_archive...this may take a minute
svc.startd: The system is down.
syncing file systems... done
Press any key to reboot.

The reboot worked fine.

-- DerekFeichtinger - 25 Nov 2008


Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r5 - 2008-12-15 - DerekFeichtinger
 
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback