Go to
previous page /
next page of Tier3 site log
25. 11. 2008 Thumper Fileserver t3fs03 problems
Dcache pools on the node were marked as unavailable. A listing of the
/data/poolname/data
directories on the node hung forever.
The
var/adm/messages
log reported some marvel controller related issues.
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: DMA comman
d timeout
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03 port 4: device reset
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: device dis
connected or device error
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03 port 4: device reset
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03 port 4: link lost
Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1:
Nov 24 20:17:02 t3fs03 port 4: link established
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx2: error on port 4:
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info] device disconnected
Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info] device connected
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@4,0 (sd
26):
Nov 24 20:17:02 t3fs03 Error for Command: read(10) Error Level: Retryable
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Requested Block: 30705670 Error Block: 30
705670
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number:
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense
Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0
Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: DMA comman
d timeout
Nov 24 21:22:32 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1:
Nov 24 21:22:32 t3fs03 port 6: device reset
Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: device dis
connected or device error
poweroff
failed to shut down the system. I then tried to do a regular shutdown from the ILOM, which also failed
WARNING: Power off requested from power button or SC, powering down the system!
Shutdown started. Tue Nov 25 19:49:02 CET 2008
WARNING: Failed to shut down the system!
So, I ordered the ILOM to do an immediate poweroff.
These are some of the ILOM entries:
1755 Tue Nov 25 20:56:46 2008 Audit Log minor
root : Set : object = /SYS/power_state : value = off : success
1754 Tue Nov 25 20:56:46 2008 Audit Log minor
KCS Command : Set ACPI Power State : system power state = no change : dev
ice power state = no change : success
1753 Tue Nov 25 20:49:37 2008 Audit Log minor
root : Set : object = /SYS/power_state : value = soft : success
1752 Tue Nov 25 20:49:37 2008 Audit Log minor
KCS Command : Set ACPI Power State : system power state = no change : dev
ice power state = no change : success
After the restart of machine, start dcache manually.
root@t3fs03 # /opt/d-cache/bin/dcache start
/pnfs/psi.ch/ not mounted - going to mount it now.
/pnfs/psi.ch (t3dcachedb01.psi.ch:/pnfsdoors): 640000 blocks -1 files
Starting dcap-t3fs03Domain Done (pid=1555)
Starting gridftp-t3fs03Domain Done (pid=1605)
Starting gsidcap-t3fs03Domain 6 Done (pid=1655)
Starting t3fs03Domain Done (pid=1709)
After this emergency shutdown I was able to reboot the system normally. The dCache pools also came up fine, even though the running jobs were not able to recover (connections stayed in a hanging state with "no movers found" state marked in the active transfers page).
--
DerekFeichtinger - 25 Nov 2008
Go to
previous page /
next page of Tier3 site log