<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3Log3][previous page]] / [[CMSTier3Log5][next page]] of Tier3 site log %M% ---+ 25. 11. 2008 Thumper Fileserver t3fs03 problems Dcache pools on the node were marked as unavailable. A listing of the =/data/poolname/data= directories on the node hung forever. The =var/adm/messages= log reported some marvel controller related issues. <pre> Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: DMA comman d timeout Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: device reset Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: device dis connected or device error Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: device reset Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: link lost Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: link established Nov 24 20:17:02 t3fs03 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx2: error on port 4: Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info] device disconnected Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info] device connected Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@4,0 (sd 26): Nov 24 20:17:02 t3fs03 Error for Command: read(10) Error Level: Retryable Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Requested Block: 30705670 Error Block: 30 705670 Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: DMA comman d timeout Nov 24 21:22:32 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1: Nov 24 21:22:32 t3fs03 port 6: device reset Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: device dis connected or device error </pre> =poweroff= failed to shut down the system. I then tried to do a regular shutdown from the ILOM, which also failed <pre> WARNING: Power off requested from power button or SC, powering down the system! Shutdown started. Tue Nov 25 19:49:02 CET 2008 WARNING: Failed to shut down the system! </pre> So, I ordered the ILOM to do an immediate poweroff. These are some of the ILOM entries: <pre> 1755 Tue Nov 25 20:56:46 2008 Audit Log minor root : Set : object = /SYS/power_state : value = off : success 1754 Tue Nov 25 20:56:46 2008 Audit Log minor KCS Command : Set ACPI Power State : system power state = no change : dev ice power state = no change : success 1753 Tue Nov 25 20:49:37 2008 Audit Log minor root : Set : object = /SYS/power_state : value = soft : success 1752 Tue Nov 25 20:49:37 2008 Audit Log minor KCS Command : Set ACPI Power State : system power state = no change : dev ice power state = no change : success </pre> After the restart of machine, start dcache manually. <pre> root@t3fs03 # /opt/d-cache/bin/dcache start /pnfs/psi.ch/ not mounted - going to mount it now. /pnfs/psi.ch (t3dcachedb01.psi.ch:/pnfsdoors): 640000 blocks -1 files Starting dcap-t3fs03Domain Done (pid=1555) Starting gridftp-t3fs03Domain Done (pid=1605) Starting gsidcap-t3fs03Domain 6 Done (pid=1655) Starting t3fs03Domain Done (pid=1709) </pre> After this emergency shutdown I was able to reboot the system normally. The dCache pools also came up fine, even though the running jobs were not able to recover (connections stayed in a hanging state with "no movers found" state marked in the active transfers page). -- Main.DerekFeichtinger - 25 Nov 2008 ---------------- %ICON{arrowleft}% Go to [[CMSTier3Log3][previous page]] / [[CMSTier3Log5][next page]] of Tier3 site log %M%
This topic: CmsTier3
>
WebHome
>
CMSTier3Log
>
CMSTier3Log4
Topic revision: r3 - 2008-12-08 - ZhilingChen
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback