Tags:
tag this topic
create new tag
view all tags
<!-- keep this as a security measure: #uncomment if the subject should only be modifiable by the listed groups * Set ALLOWTOPICCHANGE = Main.TWikiAdminGroup,Main.CMSAdminGroup * Set ALLOWTOPICRENAME = Main.TWikiAdminGroup,Main.CMSAdminGroup #uncomment this if you want the page only be viewable by the listed groups # * Set ALLOWTOPICVIEW = Main.TWikiAdminGroup,Main.CMSAdminGroup --> %TOC% %ICON{arrowleft}% Go to [[CMSTier3Log3][previous page]] / [[CMSTier3Log5][next page]] of Tier3 site log %M% ---+ 25. 11. 2008 Thumper Fileserver t3fs03 problems Dcache pools on the node were marked as unavailable. A listing of the =/data/poolname/data= directories on the node hung forever. The =var/adm/messages= log reported some marvel controller related issues. <pre> Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: DMA comman d timeout Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: device reset Nov 24 20:17:02 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx2: device on port 4 reset: device dis connected or device error Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: device reset Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: link lost Nov 24 20:17:02 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1: Nov 24 20:17:02 t3fs03 port 4: link established Nov 24 20:17:02 t3fs03 marvell88sx: [ID 812950 kern.warning] WARNING: marvell88sx2: error on port 4: Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info] device disconnected Nov 24 20:17:02 t3fs03 marvell88sx: [ID 517869 kern.info] device connected Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@4,0 (sd 26): Nov 24 20:17:02 t3fs03 Error for Command: read(10) Error Level: Retryable Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Requested Block: 30705670 Error Block: 30 705670 Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] Sense Key: No Additional Sense Nov 24 20:17:02 t3fs03 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0 Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: DMA comman d timeout Nov 24 21:22:32 t3fs03 sata: [ID 801593 kern.notice] NOTICE: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1: Nov 24 21:22:32 t3fs03 port 6: device reset Nov 24 21:22:32 t3fs03 marvell88sx: [ID 670675 kern.info] NOTICE: marvell88sx0: device on port 6 reset: device dis connected or device error </pre> =poweroff= failed to shut down the system. I then tried to do a regular shutdown from the ILOM, which also failed <pre> WARNING: Power off requested from power button or SC, powering down the system! Shutdown started. Tue Nov 25 19:49:02 CET 2008 WARNING: Failed to shut down the system! </pre> So, I ordered the ILOM to do an immediate poweroff. These are some of the ILOM entries: <pre> 1755 Tue Nov 25 20:56:46 2008 Audit Log minor root : Set : object = /SYS/power_state : value = off : success 1754 Tue Nov 25 20:56:46 2008 Audit Log minor KCS Command : Set ACPI Power State : system power state = no change : dev ice power state = no change : success 1753 Tue Nov 25 20:49:37 2008 Audit Log minor root : Set : object = /SYS/power_state : value = soft : success 1752 Tue Nov 25 20:49:37 2008 Audit Log minor KCS Command : Set ACPI Power State : system power state = no change : dev ice power state = no change : success </pre> After the restart of machine, start dcache manually. <pre> root@t3fs03 # /opt/d-cache/bin/dcache start /pnfs/psi.ch/ not mounted - going to mount it now. /pnfs/psi.ch (t3dcachedb01.psi.ch:/pnfsdoors): 640000 blocks -1 files Starting dcap-t3fs03Domain Done (pid=1555) Starting gridftp-t3fs03Domain Done (pid=1605) Starting gsidcap-t3fs03Domain 6 Done (pid=1655) Starting t3fs03Domain Done (pid=1709) </pre> After this emergency shutdown I was able to reboot the system normally. The dCache pools also came up fine, even though the running jobs were not able to recover (connections stayed in a hanging state with "no movers found" state marked in the active transfers page). ---+ 15. 12. 2008 Again t3fs03 problems - OS patching Again, the pools on t3fs03 went offline and the system was in a strange state. The system log pointed again to the marvell issue. I decided to immediately patch the system. First I again needed to forcefully shut the system down and reboot. I put this list of patches into a file =patch-list-20081215.lst=: <pre> 125556-01 SunOS 5.10_x86: patch behavior patch 138270-02 SunOS 5.10_x86: devfs patch 127128-11 SunOS 5.10_x86: kernel patch 137140-06 SunOS 5.10_x86: aggr patch 119255-59 SunOS 5.10_x86: Install and Patch Utilities Patch 137122-03 SunOS 5.10_x86: e1000g driver patch 138110-01 SunOS 5.10_x86: ata driver patch </pre> Then I issued the update commands <pre> smpatch update -x idlist=patch-list-20081215.lst </pre> The patching required a system reboot (or at least an init 0). Strangely enough, many more patches than indicated in the original dependency analysis were installed. <pre> root@t3fs03 # init 0 root@t3fs03 # svc.startd: The system is coming down. Please wait. svc.startd: 94 system services are now being stopped. Dec 15 11:35:06 t3fs03 syslogd: going down on signal 15 Dec 15 11:35:07 rpc.metad: Terminated Installing updatesInstalling update 138270-02 Succeeded Installing update 127128-11 Succeeded Installing update 137140-06 Succeeded Installing update 137122-03 Succeeded Installing update 138110-01 Succeeded Installing update 128338-02 Succeeded Installing update 126207-04 Succeeded Installing update 127889-10 Succeeded Installing update 137293-02 Failed Installing update 121429-10 Failed Installing update 137112-05 Succeeded Installing update 138071-03 Succeeded Installing update 138053-02 Succeeded Installing update 138307-01 Succeeded Installing update 120273-23 Succeeded Installing update 138065-03 Succeeded Installing update 138061-03 Succeeded Installing update 137022-02 Succeeded Installing update 138045-02 Succeeded Installing update 138043-02 Failed Installing update 121005-04 Succeeded Installing update 137290-01 Succeeded Installing update 128401-05 Succeeded Installing update 138309-02 Succeeded Installing update 128307-05 Failed Installing update 128301-04 Succeeded Installing update 138091-01 Failed Installing update 138076-02 Succeeded Installing update 126654-02 Succeeded Installing update 138084-01 Succeeded Installing update 137020-02 Succeeded Installing update 123896-04 Succeeded Creating ram disk for /var/run/.patch_root_loopbackmnt updating /var/run/.patch_root_loopbackmnt/platform/i86pc/boot_archive...this may take a minute svc.startd: The system is down. syncing file systems... done Press any key to reboot. </pre> The reboot worked fine. -- Main.DerekFeichtinger - 25 Nov 2008 ---------------- %ICON{arrowleft}% Go to [[CMSTier3Log3][previous page]] / [[CMSTier3Log5][next page]] of Tier3 site log %M%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r5
<
r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r5 - 2008-12-15
-
DerekFeichtinger
CmsTier3
Log In
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
User Pages
Main Page
Policies
Monitoring Storage Space
Monitoring Slurm Usage
Physics Groups
Steering Board Meetings
Admin Pages
AdminArea
Cluster Specs
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
CmsTier3 Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback