<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3Log3][previous page]] / [[CMSTier3Log5][next page]] of Tier3 site log %M% ---+ 25. 11. 2008 Thumper Fileserver t3fs03 problems Dcache pools on the node were marked as unavailable. A listing of the =/data/poolname/data= directories on the node hung forever. The =var/adm/messages= log reported some marvel controller related issues. <pre> Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: DMA comman d timeout Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: device reset Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: device dis connected or device error Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: device reset Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: link lost Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: link established Nov 24 20:17:02 t3fs03 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx2: error on port 4: Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info] device disconnected Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info] device connected Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@4,0 (sd 26): Nov 24 20:17:02 t3fs03 Error for Command: read(10) Error Level: Retryable Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Requested Block: 30705670 Error Block: 30 705670 Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: DMA comman d timeout Nov 24 21:22:32 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1: Nov 24 21:22:32 t3fs03 port 6: device reset Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: device dis connected or device error </pre> =poweroff= failed to shut down the system. I then tried to do a regular shutdown from the ILOM, which also failed <pre> WARNING: Power off requested from power button or SC, powering down the system! Shutdown started. Tue Nov 25 19:49:02 CET 2008 WARNING: Failed to shut down the system! </pre> So, I ordered the ILOM to do an immediate poweroff. These are some of the ILOM entries: <pre> 1755 Tue Nov 25 20:56:46 2008 Audit Log minor root : Set : object = /SYS/power_state : value = off : success 1754 Tue Nov 25 20:56:46 2008 Audit Log minor KCS Command : Set ACPI Power State : system power state = no change : dev ice power state = no change : success 1753 Tue Nov 25 20:49:37 2008 Audit Log minor root : Set : object = /SYS/power_state : value = soft : success 1752 Tue Nov 25 20:49:37 2008 Audit Log minor KCS Command : Set ACPI Power State : system power state = no change : dev ice power state = no change : success </pre> After the restart of machine, start dcache manually. <pre> root@t3fs03 # /opt/d-cache/bin/dcache start /pnfs/psi.ch/ not mounted - going to mount it now. /pnfs/psi.ch (t3dcachedb01.psi.ch:/pnfsdoors): 640000 blocks -1 files Starting dcap-t3fs03Domain Done (pid=1555) Starting gridftp-t3fs03Domain Done (pid=1605) Starting gsidcap-t3fs03Domain 6 Done (pid=1655) Starting t3fs03Domain Done (pid=1709) </pre> After this emergency shutdown I was able to reboot the system normally. The dCache pools also came up fine, even though the running jobs were not able to recover (connections stayed in a hanging state with "no movers found" state marked in the active transfers page). ---+ 15. 12. 2008 Again t3fs03 problems - OS patching Again, the pools on t3fs03 went offline and the system was in a strange state. The system log pointed again to the marvell issue. I decided to immediately patch the system. First I again needed to forcefully shut the system down and reboot. I put this list of patches into a file =patch-list-20081215.lst=: <pre> 125556-01 SunOS 5.10_x86: patch behavior patch 138270-02 SunOS 5.10_x86: devfs patch 127128-11 SunOS 5.10_x86: kernel patch 137140-06 SunOS 5.10_x86: aggr patch 119255-59 SunOS 5.10_x86: Install and Patch Utilities Patch 137122-03 SunOS 5.10_x86: e1000g driver patch 138110-01 SunOS 5.10_x86: ata driver patch </pre> Then I issued the update commands <pre> smpatch update -x idlist=patch-list-20081215.lst </pre> The patching required a system reboot (or at least an init 0). Strangely enough, many more patches than indicated in the original dependency analysis were installed. <pre> root@t3fs03 # init 0 root@t3fs03 # svc.startd: The system is coming down. Please wait. svc.startd: 94 system services are now being stopped. Dec 15 11:35:06 t3fs03 syslogd: going down on signal 15 Dec 15 11:35:07 rpc.metad: Terminated Installing updatesInstalling update 138270-02 Succeeded Installing update 127128-11 Succeeded Installing update 137140-06 Succeeded Installing update 137122-03 Succeeded Installing update 138110-01 Succeeded Installing update 128338-02 Succeeded Installing update 126207-04 Succeeded Installing update 127889-10 Succeeded Installing update 137293-02 Failed Installing update 121429-10 Failed Installing update 137112-05 Succeeded Installing update 138071-03 Succeeded Installing update 138053-02 Succeeded Installing update 138307-01 Succeeded Installing update 120273-23 Succeeded Installing update 138065-03 Succeeded Installing update 138061-03 Succeeded Installing update 137022-02 Succeeded Installing update 138045-02 Succeeded Installing update 138043-02 Failed Installing update 121005-04 Succeeded Installing update 137290-01 Succeeded Installing update 128401-05 Succeeded Installing update 138309-02 Succeeded Installing update 128307-05 Failed Installing update 128301-04 Succeeded Installing update 138091-01 Failed Installing update 138076-02 Succeeded Installing update 126654-02 Succeeded Installing update 138084-01 Succeeded Installing update 137020-02 Succeeded Installing update 123896-04 Succeeded Creating ram disk for /var/run/.patch_root_loopbackmnt updating /var/run/.patch_root_loopbackmnt/platform/i86pc/boot_archive...this may take a minute svc.startd: The system is down. syncing file systems... done Press any key to reboot. </pre> The reboot worked fine. -- Main.DerekFeichtinger - 25 Nov 2008 ---------------- %ICON{arrowleft}% Go to [[CMSTier3Log3][previous page]] / [[CMSTier3Log5][next page]] of Tier3 site log %M%
This topic: CmsTier3
>
WebHome
>
CMSTier3Log
>
CMSTier3Log4
Topic revision: r5 - 2008-12-15 - DerekFeichtinger
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback