Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

25. 11. 2008 Thumper Fileserver t3fs03 problems

Dcache pools on the node were marked as unavailable. A listing of the /data/poolname/data directories on the node hung forever. The var/adm/messages log reported some marvel controller related issues.

Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: DMA comman
d timeout
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03  port 4: device reset
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: device dis
connected or device error
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03  port 4: device reset
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03  port 4: link lost
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03  port 4: link established
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx2: error on port 4:
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info]       device disconnected
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info]       device connected
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@4,0 (sd
26):
Nov 24 20:17:02 t3fs03  Error for Command: read(10)                Error Level: Retryable
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice]    Requested Block: 30705670                  Error Block: 30
705670
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice]    Vendor: ATA                                Serial Number:

Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice]    Sense Key: No Additional Sense
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice]    ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: DMA comman
d timeout
Nov 24 21:22:32 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1:
Nov 24 21:22:32 t3fs03  port 6: device reset
Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: device dis
connected or device error

poweroff failed to shut down the system. I then tried to do a regular shutdown from the ILOM, which also failed

WARNING: Power off requested from power button or SC, powering down the system!

Shutdown started.    Tue Nov 25 19:49:02 CET 2008

WARNING: Failed to shut down the system!

So, I ordered the ILOM to do an immediate poweroff.

These are some of the ILOM entries:

1755   Tue Nov 25 20:56:46 2008  Audit     Log       minor
       root : Set : object = /SYS/power_state : value = off : success
1754   Tue Nov 25 20:56:46 2008  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = no change : dev
       ice power state = no change : success
1753   Tue Nov 25 20:49:37 2008  Audit     Log       minor
       root : Set : object = /SYS/power_state : value = soft : success
1752   Tue Nov 25 20:49:37 2008  Audit     Log       minor
       KCS Command : Set ACPI Power State : system power state = no change : dev
       ice power state = no change : success

After the restart of machine, start dcache manually.

root@t3fs03 # /opt/d-cache/bin/dcache start
/pnfs/psi.ch/ not mounted - going to mount it now.
/pnfs/psi.ch       (t3dcachedb01.psi.ch:/pnfsdoors):  640000 blocks       -1 files
Starting dcap-t3fs03Domain  Done (pid=1555)
Starting gridftp-t3fs03Domain  Done (pid=1605)
Starting gsidcap-t3fs03Domain  6 Done (pid=1655)
Starting t3fs03Domain  Done (pid=1709)

After this emergency shutdown I was able to reboot the system normally. The dCache pools also came up fine, even though the running jobs were not able to recover (connections stayed in a hanging state with "no movers found" state marked in the active transfers page).

-- DerekFeichtinger - 25 Nov 2008


Arrow left Go to previous page / next page of Tier3 site log MOVED TO...

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2008-12-08 - ZhilingChen
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platform Powered by Perl This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback