MAY 2013
To get the Solaris broken disks statistics run:
[root@t3nagios ~]# /usr/local/bin/disks_failure_statistics.sh
Disk problems on the Thumper/Thor Fileservers
Important external information
Best practice for disk replacement
2013-05-31 Replacing a disk without resilvering a spare first
Starting situation: The ILOM has issued a predicitive failure for a disk based on SMART value detections.
- Check Solaris Fault manager. We learn from it the number of the disk (28) in this example
root@t3fs09 $ fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
May 31 09:28:37 703aae75-6f49-638e-8ab5-eb08e580d005 DISK-8000-0X Major
Host : t3fs09
Platform : Sun Fire X4540 Chassis_id : 0949AMR064
Fault class : fault.io.disk.predictive-failure
Affects : dev:///:devid=id1,sd@n5000c50019cb5f77//pci@3c,0/pci10de,377@a/pci1000,1000@0/sd@4,0
faulted but still in service
FRU : "HD_ID_28" (hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR064:server-id=t3fs09:serial=9QJ5SE83:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=28/disk=0)
faulty
- Have a look at the smart values and find the disk name (here: c4t4) using the hd tool
root@t3fs09 $ hd -R
...
25 c4t1 0 150 55880057542 27 0 0 33 25510 0 27 909 909 28 13 39 0 0 0 0
26 c4t2 79081685 0 29 23 8238056347322 29147 0 29 0 0 4295032833 0 639172636 28 0 20 79081685 0 0 0
27 c4t3 0 90 25786319174 10 0 0 31 9392 0 10 27 27 28 16 35 0 0 0 0
28 c4t4 76058431 0 29 1383 43229726855 29148 0 29 0 1754 131118 0 571801623 23 0 19 76058431 0 0 0
...
- find the device mapping that we need to use with the cfgadm command
root@t3fs09 $ cfgadm -a | grep c4t4
c4::dsk/c4t4d0 disk connected configured unknown
- Offline the disk and uncofigure it, so that the blue disk LED helps you to locate it.
root@t3fs09 $ zpool offline data1 c4t4d0
root@t3fs09 $ cfgadm -c unconfigure c4::dsk/c4t4d0
- Replace the disk in the Thor/Thumper
- Make the disk active for the system by configuring it. Note: This regrettably always results in the system upon seeing the new disk needlessly activating a spare.
root@t3fs09 $ cfgadm -c configure c4::dsk/c4t4d0
root@t3fs09 $ zpool status -x
...
spare DEGRADED 0 0 108K
replacing DEGRADED 0 0 0
c4t4d0s0/o FAULTED 0 0 0 corrupted data
c4t4d0 ONLINE 0 0 0 1.88G resilvered
c6t5d0 ONLINE 0 0 0 3.74G resilvered
c5t1d0 ONLINE 0 0 0
...
spares
c6t5d0 INUSE currently in use
c6t6d0 AVAIL
c6t7d0 AVAIL
- Stop the needlessly activated spare from resilvering and detach it
root@t3fs09 $ zpool scrub -s data1
root@t3fs09 $ zpool detach data1 c6t5d0
root@t3fs09 $ zpool status -x
...
c3t7d0 ONLINE 0 0 0
replacing DEGRADED 0 0 0
c4t4d0s0/o FAULTED 0 0 0 corrupted data
c4t4d0 ONLINE 0 0 0 643M resilvered
c5t1d0 ONLINE 0 0 0
...
- Issue the replacement command to have the new disk resilvered
root@t3fs09 $ zpool replace data1 c4t4d0
- Enter the string used for the replacement (Here: zpool replace data1 c4t4d0) as a comment for the check_zfs_data1 test of the T3 Nagios. For this example the URL would be: https://t3nagios.psi.ch/nagios/cgi-bin/cmd.cgi?cmd_typ=34&host=t3fs09&service=check_zfs_data1
Typical example (old... from 2009)
Logwatch (q.v.
CentralLogHost) shows the following for t3fs02
Logfiles for Host: t3fs02
##################################################################
--------------------- Kernel module scsi Begin ------------------------
You may have R/W errors on your device 2 Time(s)
Requested Block: 203445440 Error Block: 203445440: 1 time(s)
WARNING: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@4,0 (sd5):: 2 time(s)
ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0: 2 time(s)
Sense Key: Aborted_Command: 3 time(s)
Vendor: ATA Serial Number: : 2 time(s)
Requested Block: 203445696 Error Block: 203445696: 1 time(s)
---------------------- Kernel module scsi End -------------------------
Logging in to t3fs02 and running a ZFS status command
zpool status -x
pool: data1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
data1 ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
c9t4d0 ONLINE 0 0 0
...
c10t5d0 ONLINE 0 0 0
c9t1d0 ONLINE 0 0 1
c9t5d0 ONLINE 0 0 0
...
spares
c4t3d0 AVAIL
c4t7d0 AVAIL
errors: No known data errors
So, there is no damage yet, but one should keep an eye on that server. If the errors get more frequent, the disk should be replaced.
The disk with the checksum error in the status report is c9t1d0. But if I map the PCI name given in the log line, I end up with a different disk name
hd -w /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@4,0
c4t4 = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@4,0
Examples of failures
A particularly bad failure from T2_CH_CSCS on X4500/Solaris10 (happened twice within a few months, even though the backplane was exchanged):
Oct 20 09:55:15 se25.lcg.cscs.ch Command failed to complete...Device is gone
Oct 20 09:55:15 se25.lcg.cscs.ch scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@4,0 (sd20):
Oct 20 09:55:15 se25.lcg.cscs.ch drive offline
List of occurences year 2009:
t3fs02:
Sep 22 18:12:40 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@4,0 (sd5):
Sep 22 18:12:40 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@4,0 (sd5):
Oct 1 21:33:18 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@6,0 (sd49):
Oct 1 21:33:18 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@6,0 (sd49):
Oct 20 06:54:39 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@2,0 (sd21):
Oct 20 06:54:39 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@2,0 (sd21):
Oct 20 06:54:39 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@2,0 (sd21):
Oct 20 06:54:39 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@2,0 (sd21):
Note: t3fs02 currently actually runs on
OpenSolaris snv_86_rc3 X86, while the other servers run on Solaris10. It may well be that we only see warnings on this machine due to a difference in the log reporting. Especially since different disks and controllers are involved in the errors.
I cleared the error status for t3fs02 on 2010-01-07 since no more irrecoverable errors had appeared. Recoverable errors like the above seem to happen from time to time.
zpool clear data1
List of occurences year 2010:
t3fs07 2010-07-15 - example of a successful disk exchange in a running X4540 system
Disk: c3t7d0
I needed to introduce a spare disk manually. Since this was immediately before my holidays, I did it quite fast and then regrettably forgot about the incident after my vacation.
---------------------SunFireX4540-------Rear----------------------------
3: 7: 11: 15: 19: 23: 27: 31: 35: 39: 43: 47:
c1t3 c1t7 c2t3 c2t7 c3t3 c3t7 c4t3 c4t7 c5t3 c5t7 c6t3 c6t7
^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
2: 6: 10: 14: 18: 22: 26: 30: 34: 38: 42: 46:
c1t2 c1t6 c2t2 c2t6 c3t2 c3t6 c4t2 c4t6 c5t2 c5t6 c6t2 c6t6
^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
1: 5: 9: 13: 17: 21: 25: 29: 33: 37: 41: 45:
c1t1 c1t5 c2t1 c2t5 c3t1 c3t5 c4t1 c4t5 c5t1 c5t5 c6t1 c6t5
^b+ ^++ ^b+ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
0: 4: 8: 12: 16: 20: 24: 28: 32: 36: 40: 44:
c1t0 c1t4 c2t0 c2t4 c3t0 c3t4 c4t0 c4t4 c5t0 c5t4 c6t0 c6t4
^b+ ^++ ^b+ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
-------*-----------*-SunFireX4540---*---Front----*---------*--------
cfgadm -al
Ap_Id Type Receptacle Occupant Condition
...
c3::dsk/c3t7d0 disk connected configured unknown
...
cfgadm -c unconfigure c3::dsk/c3t7d0
raidz2 DEGRADED 0 0 0
c1t3d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
spare DEGRADED 0 0 40.0M
c3t7d0 REMOVED 0 0 0
c6t7d0 ONLINE 0 0 0 725G resilvered
c4t4d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
I replaced disk 23 (marked by a blue LED) in the running system. This can be done if the cover is removed for less than 60 seconds.
Bringing the disk online:
cfgadm -c configure c3::dsk/c3t7d0
zpool replace data1 c3t7d0
# resilvering
root@t3fs07 $ zpool status -x
pool: data1
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver in progress for 0h0m, 0.07% done, 12h12m to go
config:
NAME STATE READ WRITE CKSUM
data1 DEGRADED 0 0 0
raidz2 ONLINE 0 0 0
...
raidz2 DEGRADED 0 0 0
c1t3d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
spare DEGRADED 0 0 40.0M
replacing DEGRADED 0 0 0
c3t7d0s0/o FAULTED 0 0 0 corrupted data
c3t7d0 ONLINE 0 0 0 441M resilvered
c6t7d0 ONLINE 0 0 0 20.5K resilvered
c4t4d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
...
After a few hours of resilvering, the spare disk was automatically taken out of the configuration and the array was fixed.
t3fs09 2010-09-12 - example of a successful disk exchange on a X4540 (OS powered down)
Automatic failover has happened
zpool status data1
...
resilver completed after 7h4m with 0 errors on Sun Sep 12 13:13:22 2010
...
raidz2 DEGRADED 0 0 0
c1t1d0 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
spare DEGRADED 0 0 24.5M
c4t7d0 FAULTED 3 62 0 too many errors
c6t7d0 ONLINE 0 0 0 445G resilvered
c5t4d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
...
In the ILOM log I find
745 Fri Sep 10 01:21:23 2010 IPMI Log critical
ID = ce : 09/10/2010 : 01:21:23 : Drive Slot : DBP/HDD31/STATE : Drive
Fault
The hd command suddenly blocked in the listing. Almost unkillable.
The internal Solaris 10 fault reporting showed:
root@t3fs09 $ fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Sep 10 03:21:58 69cc60ac-9f06-4e60-f7fa-da22d6374ed2 DISK-8000-0X Major
Host : t3fs09
Platform : Sun Fire X4540 Chassis_id : 0949AMR064
Fault class : fault.io.disk.predictive-failure
Affects : dev:///:devid=id1,sd@n5000c5002065fb8d//pci@3c,0/pci10de,377@a/pci1000,1000@0/sd@7,0
faulted but still in service
FRU : "HD_ID_31" (hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR064:server-id=t3fs09:serial=9QJ5Z2FT:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=31/disk=0)
faulty
Description : SMART health-monitoring firmware reported that a disk
failure is imminent.
Refer to http://sun.com/msg/DISK-8000-0X for more information.
Response : None.
Impact : It is likely that the continued operation of
this disk will result in data loss.
Action : Schedule a repair procedure to replace the affected disk.
Use fmdump -v -u to identify the disk.
I tried to unconfigure the disk
# the disk name to be used in the cfgadm command one can obtain by this command
cfgadm -al
cfgadm -c unconfigure c4::dsk/c4t7d0
cfgadm: Hardware specific failure: failed to unconfigure SCSI device: Device busy
I then tried to first take the disk offline. The command succeeded, but the
zpool status
still looked the same, and also the unconfiguring failed again.
zpool offline data1 c4t7d0
cfgadm -c unconfigure c4::dsk/c4t7d0
cfgadm: Hardware specific failure: failed to unconfigure SCSI device: Device busy
Logging in on the console, I got the following message. It repeats every minute or so!!!!
13:23:53 t3fs09 scsi: WARNING: /pci@3c,0/pci10de,377@a/pci1000,1000@0 (mpt3):
# mapping the PCI address yields a disk on the same controller c4
hd -w /pci@3c,0/pci10de,377@a/pci1000,1000@0
c4t0 = /pci@3c,0/pci10de,377@a/pci1000,1000@0
These items may be relevant:
iostat shows these errors
iostat -En
...
c4t7d0 Soft Errors: 2 Hard Errors: 4 Transport Errors: 9731
Vendor: ATA Product: SEAGATE ST31000N Revision: SU0E Serial No:
Size: 1000.20GB <1000204885504 bytes>
Media Error: 1 Device Not Ready: 0 No Device: 3 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
...
c4t0d0 Soft Errors: 2 Hard Errors: 2 Transport Errors: 0
Vendor: ATA Product: SEAGATE ST31000N Revision: SU0E Serial No:
Size: 1000.20GB <1000204885504 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
...
# all other disks show typically this (all show soft errors, but only a few with hard errors)
c5t3d0 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: ATA Product: SEAGATE ST31000N Revision: SU0E Serial No:
Size: 1000.20GB <1000204885504 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
I now exchanged the disk in the powered down system (OS was down, but I kept the machine under power and the management processor was online).
The defect disk was marked by a blue LED.
Ok. Let's try to bring the disk online.
cfgadm -c configure c4::dsk/c4t7d0
# still listed as FAULTED in the zpool status
zpool clear data1 c4t7d0
# now listed as OFFLINE in the zpool status
zpool online data1 c4t7d0
SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Fri Oct 8 17:01:02 CEST 2010
PLATFORM: Sun Fire X4540, CSN: 0949AMR064 , HOSTNAME: t3fs09
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 4edd757e-3dfb-e504-a43c-f81be2b69de3
DESC: A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for more information.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run 'zpool status -x' and replace the bad device.
warning: device 'c4t7d0' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
So, the last two commands did not help much.
Ok. Let's use the
replace command with the single disk argument. This should announce to the system that there is a new disk in the slot
zpool replace data1 c4t7d0
root@t3fs09 $ zpool status -x
pool: data1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.17% done, 7h10m to go
config:
NAME STATE READ WRITE CKSUM
data1 DEGRADED 0 0 0
raidz2 ONLINE 0 0 0
...
raidz2 DEGRADED 0 0 0
c1t1d0 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
spare DEGRADED 0 0 79
replacing DEGRADED 0 0 0
c4t7d0s0/o FAULTED 0 0 0 corrupted data
c4t7d0 ONLINE 0 0 0 1.08G resilvered
c6t7d0 ONLINE 0 0 0 41.5K resilvered
c5t4d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
...
OK. This did it. The new disk immediately started to be resilvered. After it had finished, several hours later, the spare disk automatically was taken out of the raidz2 array again, and put into the standby state.
t3fs07 2010-09-30 warnings
Disk: c2t6d0
Entries in the central logs. Shows 16 read errors in the
zpool status
output.
Observation: The older broken disk and also this one show high counts in the SMART monitoring values for "Command Timeout Count". This seems significant.
No entry in the ILOM logs.
t3fs07 2010-10-18 disk failure
System is up, but
zpool status
just freezes. Dcache seems to hang as well
root@t3fs07 $ zpool status
pool: data1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver completed after 4h30m with 0 errors on Fri Oct 8 22:08:15 2010
... hangs ...
The resilver mentioned above refers to the last problem of this file server.
fmadm faulty
hangs for quite some time before yielding
Host : t3fs07
Platform : Sun Fire X4540 Chassis_id : 0949AMR020
Fault class : fault.io.disk.predictive-failure
Affects : dev:///:devid=id1,sd@n5000c50019c3b9c2//pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0
faulted but still in service
FRU : "HD_ID_3" (hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR020:server-id=t3fs07:serial=9QJ5R2HB:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=3/disk=0)
faulty
Description : SMART health-monitoring firmware reported that a disk
failure is imminent.
Refer to http://sun.com/msg/DISK-8000-0X for more information.
Response : None.
Impact : It is likely that the continued operation of
this disk will result in data loss.
Action : Schedule a repair procedure to replace the affected disk.
Use fmdump -v -u to identify the disk.
root@t3fs07 $ fmdump -v -u df4e42b9-447f-ea75-8b80-d7165084fd40
TIME UUID SUNW-MSG-ID
Oct 17 22:40:52.5630 df4e42b9-447f-ea75-8b80-d7165084fd40 DISK-8000-0X
100% fault.io.disk.predictive-failure
Problem in: hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR020:server-id=t3fs07:serial=9QJ5R2HB:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=3/disk=0
Affects: dev:///:devid=id1,sd@n5000c50019c3b9c2//pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0
FRU: hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR020:server-id=t3fs07:serial=9QJ5R2HB:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=3/disk=0
Location: HD_ID_3
root@t3fs07 $ fmdump
TIME UUID SUNW-MSG-ID
Oct 08 17:38:06.5585 3f41b1d2-d666-c799-fa35-cb4dfa402077 ZFS-8000-D3
Oct 17 22:40:52.5630 df4e42b9-447f-ea75-8b80-d7165084fd40 DISK-8000-0X
The syslog is full of these messages
grep t3fs07 messages | sed -e 's/.*scsi: *//' | sort | uniq -c
695 [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
128 [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
695 [ID 365881 kern.info] /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
1 Oct 18 13:41:00 t3fs07.psi.ch genunix: [ID 773945 kern.info] UltraDMA mode 2 selected
In the ILOM log I find
769 Sun Oct 17 21:39:34 2010 IPMI Log critical
ID = 17e : 10/17/2010 : 21:39:34 : Drive Slot : DBP/HDD3/STATE : Drive F
ault
I decided to reboot the system. Regrettably the system did not shut down. Console output
Oct 18 13:52:46 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 18 13:52:46 t3fs07 Disconnected command timeout for Target 3
Oct 18 13:53:57 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 18 13:53:57 t3fs07 Disconnected command timeout for Target 3
I had to force the system down through the ILOM with
stop -force /SP
.
The system came up with no disk marked as faulty. Everything seemed all right. All commands worked.... strange. I left it running in the hope that the next failure would trigger an automatic failover (which worked great the last times).
The server again failed during the night. The symptoms very similar.
- Dcache does not deliver files any more
-
zpool status -x
reports that all pools are healthy!!!
-
zpool status
just hangs forever (cannot be killed... utterly lost in kernel space)
-
fmadm faulty
hangs for a long time before issueing the same error as before the reboot, above(Event-ID df4e42b9-447f-ea75-8b80-d7165084fd40)
- messages log points to disk
/pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0
(see below)
central messages log
# grep t3fs07 messages | sed -e 's/.*scsi: *//' | sort | uniq -c
455 [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
51 [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
455 [ID 365881 kern.info] /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Trying to map the disk that is reported as problematic in the messages log identifies the same disk as indicated by the
fmadm
commands,
c1t3.
hd -w /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0
c1t3 = /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0
Trying to power off the system:
poweroff
Oct 19 09:29:56 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:29:56 t3fs07 Disconnected command timeout for Target 3
Oct 19 09:29:57 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
Oct 19 09:29:57 t3fs07 drive offline
Oct 19 09:29:57 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
Oct 19 09:29:57 t3fs07 i/o to invalid geometry
Oct 19 09:31:07 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:31:07 t3fs07 Disconnected command timeout for Target 3
Oct 19 09:32:18 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:32:18 t3fs07 Disconnected command timeout for Target 3
I had to shut the system down forcefully through the ILOM with
stop -force /SYS
.
The system took some time to boot up again. After the reboot,
zpool status
worked correctly, and I decided to try a manual disk replace.
Trying to manually replace the faulted disk
zpool replace data1 c1t3d0 c6t7d0
# This seemed to start correcty, and I was able to monitor the beginning of the operation as usually
root@t3fs07 $ zpool status data1
pool: data1
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.05% done, 14h6m to go
config:
NAME STATE READ WRITE CKSUM
data1 ONLINE 0 0 0
...
BUT then suddenly, the zpool status
again blocks!!!!! And again, the commands keep hanging and cannot be terminated even by SIGKILL.
root@t3fs07 $ zpool status data1
pool: data1
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h2m, 0.46% done, 8h45m to go
*** HANGS - CANNOT BE KILLED ***
# on the console I see
# Oct 19 09:52:36 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
Oct 19 09:52:36 t3fs07 SCSI transport failed: reason 'reset': retrying command
Oct 19 09:53:18 t3fs07 scsi: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:53:18 t3fs07 mpt0: unknown event 13 received
Oct 19 09:54:20 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:54:20 t3fs07 Disconnected command timeout for Target 3
Oct 19 09:55:31 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:55:31 t3fs07 Disconnected command timeout for Target 3
Oct 19 09:56:42 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:56:42 t3fs07 Disconnected command timeout for Target 3
...
Oct 19 10:09:44 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
Oct 19 10:09:44 t3fs07 SCSI transport failed: reason 'reset': giving up
...
Oct 19 12:54:13 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
Oct 19 12:54:13 t3fs07 drive offline
...
Small correction: The
zpool status
command was terminated based on the SIGKILL after about 3 hours.
Opening the system at runtime shows a yellow LED on the defective disk c1t3d0 (slot 3).
On Wed. Oct 20 I
received a replacement disk.
- issued a shutdown of the system
- had to shut the system down forcefully over the ILOM
- exchanged the disk (still, a yellow LED is also shown after inserting the new disk
- started /SYS
- system again takes a long time in the initializing phase
Upon startup I get on the console after some time:
SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Wed Oct 20 14:08:25 CEST 2010
PLATFORM: Sun Fire X4540, CSN: 0949AMR020 , HOSTNAME: t3fs07
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 82b554c1-6136-6b20-ed16-f12378200985
DESC: A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for more information.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run 'zpool status -x' and replace the bad device.
root@t3fs07 $ zpool status -x
pool: data1
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
data1 DEGRADED 0 0 0
raidz2 ONLINE 0 0 0
...
raidz2 DEGRADED 0 0 0
spare DEGRADED 0 0 0
c1t3d0 FAULTED 0 0 0 too many errors
c6t7d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t7d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
...
spares
c6t7d0 INUSE currently in use
c6t6d0 AVAIL
c6t5d0 AVAIL
errors: No known data errors
cfgadm -al
Ap_Id Type Receptacle Occupant Condition
c1 scsi-bus connected configured unknown
c1::dsk/c1t0d0 disk connected configured unknown
c1::dsk/c1t1d0 disk connected configured unknown
c1::dsk/c1t2d0 disk connected configured unknown
c1::dsk/c1t3d0 disk connected configured unknown
...
Making the system aware of the physical replacement of the disk
zpool replace data1 c1t3d0
root@t3fs07 $ zpool status
pool: data1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h1m, 0.15% done, 16h12m to go
config:
...
raidz2 DEGRADED 0 0 0
spare DEGRADED 0 0 48.4K
replacing DEGRADED 0 0 0
c1t3d0s0/o FAULTED 0 0 0 too many errors
c1t3d0 ONLINE 0 0 0 1.00G resilvered
c6t7d0 ONLINE 0 0 0 871M resilvered
c2t0d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t7d0 ONLINE 0 0 0
...
After I had issued this command, it seems that now
both the spare disk and the new disk are resilvered (the system had failed to correctly bring in the spare disk yesterday, when still the defect disk was in place). I hope that this will not lead to further complications.
I was able to stop the unnecessary resilvering of the spare disk by taking it out of the raidset using
zpool detach
:
zpool detach data1 c6t7d0
root@t3fs07 $ zpool status
pool: data1
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.11% done, 7h28m to go
config:
NAME STATE READ WRITE CKSUM
data1 DEGRADED 0 0 0
...
c6t2d0 ONLINE 0 0 0
raidz2 DEGRADED 0 0 0
replacing DEGRADED 0 0 0
c1t3d0s0/o FAULTED 0 0 0 too many errors
c1t3d0 ONLINE 0 0 0 779M resilvered
c2t0d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t7d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
...
spares
c6t7d0 AVAIL
c6t6d0 AVAIL
c6t5d0 AVAIL
errors: No known data errors
Next morning, the
zfs status
command shows a
complete resilvering has occured and everything looks good
fmadm faulty
still shows the same error as above (from Oct 17). The newer failure events concerning this problem (Oct 19, 20) seem to have been correctly cleared, though.
root@t3fs07 $ fmdump
TIME UUID SUNW-MSG-ID
Oct 08 17:38:06.5585 3f41b1d2-d666-c799-fa35-cb4dfa402077 ZFS-8000-D3
Oct 17 22:40:52.5630 df4e42b9-447f-ea75-8b80-d7165084fd40 DISK-8000-0X
Oct 19 20:44:20.6741 67164ab4-ab72-e1fc-c094-bd67f06d7db3 ZFS-8000-FD
Oct 20 14:08:26.1544 82b554c1-6136-6b20-ed16-f12378200985 ZFS-8000-D3
I manually made the fault manager aware of the repair of this old problem
root@t3fs07 $ fmadm repaired "dev:///:devid=id1,sd@n5000c50011234891//pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0"
fmadm: recorded repair to of dev:///:devid=id1,sd@n5000c50011234891//pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0
root@t3fs07 $ fmadm faulty
* no more output of failures *
t3fs11 2010-11-12 disk failure (no automatic failover, scsi timeouts)
- On 2010-11-12 about 17:30h the dcache services on t3fs11 became unresponsive
- SSH login did no longer work
- login through SP console was worked, but a "fmadm faulty" command immediately blocked
SP/logs:
1734 Wed Nov 10 03:28:09 2010 IPMI Log critical
ID = 391 : 11/10/2010 : 03:28:09 : Drive Slot : DBP/HDD40/STATE : Drive
Fault
1727 Wed Oct 20 09:19:47 2010 IPMI Log critical
ID = 38a : 10/20/2010 : 09:19:47 : Drive Slot : DBP/HDD33/STATE : Drive
Fault
THIS IS STRANGE. I was dead sure that I had checked all nodes with "spool status / zpool status -x" after the last problems. I had not seen the older disk problem!
On console:
Nov 12 17:38:41 t3fs11 SCSI transport failed: reason 'reset': giving up
Nov 12 17:38:41 t3fs11 scsi: WARNING: /pci@3c,0/pci10de,375@b/pci1000,1000@0/sd@1,0 (sd33):
Nov 12 17:38:41 t3fs11 SCSI transport failed: reason 'reset': giving up
Nov 12 17:38:41 t3fs11 scsi: WARNING: /pci@3c,0/pci10de,375@b/pci1000,1000@0/sd@1,0 (sd33):
Nov 12 17:38:41 t3fs11 SCSI transport failed: reason 'reset': giving up
Nov 12 17:41:02 t3fs11 scsi: WARNING: /pci@3c,0/pci10de,375@b/pci1000,1000@0 (mpt4):
Nov 12 17:41:02 t3fs11 Disconnected command timeout for Target 1
Nov 12 17:41:03 t3fs11 scsi: WARNING: /pci@3c,0/pci10de,375@b/pci1000,1000@0/sd@1,0 (sd33):
Nov 12 17:41:03 t3fs11 SCSI transport failed: reason 'reset': giving up
I forcefully rebooted the system. It took a long time in the initialization phase.
After the OS was up, everything looked more or less ok, and a zpool status yielded
root@t3fs11 $ date
Fri Nov 12 17:56:42 CET 2010
root@t3fs11 $ zpool status -x
all pools are healthy
But an fmadm check reveals the problems!!!!!!
root@t3fs11 $ fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Nov 10 02:28:42 f06347f4-ced1-ccc9-de60-bb0c727e17aa DISK-8000-0X Major
Host : t3fs11
Platform : Sun Fire X4540 Chassis_id : 0947AMR033
Fault class : fault.io.disk.predictive-failure
Affects : dev:///:devid=id1,sd@n5000c50019d0b756//pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@0,0
faulted but still in service
FRU : "HD_ID_40" (hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5RC85:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=40/disk=0)
faulty
Description : SMART health-monitoring firmware reported that a disk
failure is imminent.
Refer to http://sun.com/msg/DISK-8000-0X for more information.
Response : None.
Impact : It is likely that the continued operation of
this disk will result in data loss.
Action : Schedule a repair procedure to replace the affected disk.
Use fmdump -v -u to identify the disk.
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Oct 20 09:20:10 5eb5618e-eb59-e6e4-852d-9c035d56d620 DISK-8000-0X Major
Host : t3fs11
Platform : Sun Fire X4540 Chassis_id : 0947AMR033
Fault class : fault.io.disk.predictive-failure
Affects : dev:///:devid=id1,sd@n5000c50019bccc74//pci@3c,0/pci10de,375@b/pci1000,1000@0/sd@1,0
faulted but still in service
FRU : "HD_ID_33" (hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5P7HJ:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=33/disk=0)
faulty
Description : SMART health-monitoring firmware reported that a disk
failure is imminent.
Refer to http://sun.com/msg/DISK-8000-0X for more information.
Response : None.
Impact : It is likely that the continued operation of
this disk will result in data loss.
Action : Schedule a repair procedure to replace the affected disk.
Use fmdump -v -u to identify the disk.
Let's read out the SMART values manually:
root@t3fs11 $ hd -R
0 c1t0 71156486 0 29 3 4333649840 6802 0 29 0 0 0 0 589496355 35 0 21 71156486 0 0 0
1 c1t1 156722990 0 29 0 8629727596 6802 0 29 0 0 0 0 656801830 38 0 23 156722990 0 0 0
2 c1t2 82792345 0 29 1 12923635461 6802 0 29 0 0 0 0 673710120 40 0 22 82792345 0 0 0
3 c1t3 131980351 0 28 1 38787307 6821 0 28 0 0 0 0 741015596 44 0 23 131980351 0 0 0
4 c1t4 2105829 0 29 0 39140233 6802 0 29 0 0 0 0 589496355 35 0 21 2105829 0 0 0
5 c1t5 238257017 0 29 0 8629173752 6802 0 29 0 1 0 0 623181861 37 0 21 238257017 0 0 0
6 c1t6 113587266 0 29 0 39442020 6802 0 29 0 0 0 0 640024614 38 0 21 113587266 0 0 0
7 c1t7 4408992 0 28 0 40359853 6821 0 28 0 0 0 0 707264553 41 0 22 4408992 0 0 0
8 c2t0 162455847 0 30 2 4336062036 6802 0 30 0 0 0 0 589430819 35 0 21 162455847 0 0 0
9 c2t1 145513360 0 30 0 4336311849 6802 0 30 0 0 0 0 623181861 37 0 22 145513360 0 0 0
10 c2t2 226809752 0 30 16 41208154 6802 0 30 0 0 0 0 656801831 39 0 21 226809752 0 0 0
11 c2t3 174125163 0 29 0 4338248677 6821 0 29 0 0 0 0 656867367 39 0 21 174125163 0 0 0
12 c2t4 219429979 0 30 1 40478020 6801 0 30 0 0 0 0 589496355 35 0 21 219429979 0 0 0
13 c2t5 197134363 0 30 0 4335708953 6802 0 30 0 0 0 0 623116325 37 0 21 197134363 0 0 0
14 c2t6 173069853 0 30 1 12925996326 6802 0 30 0 0 0 0 640024614 38 0 21 173069853 0 0 0
15 c2t7 170349382 0 30 1 8633168282 6802 0 30 0 0 0 0 690487337 41 0 22 170349382 0 0 0
16 c3t0 231005259 0 29 0 40836922 6802 0 29 0 0 0 0 572653602 34 0 21 231005259 0 0 0
17 c3t1 185048562 0 29 1 25810826414 6802 0 29 0 0 0 0 640024614 38 0 23 185048562 0 0 0
18 c3t2 153398438 0 30 3 41376728 6802 0 30 0 0 0 0 673644584 40 0 23 153398438 0 0 0
19 c3t3 184747717 0 29 0 43791532 6852 0 29 0 0 0 0 673644584 40 0 21 184747717 0 0 0
20 c3t4 129781851 0 30 10 40624346 6802 0 30 0 0 0 0 606273572 36 0 22 129781851 0 0 0
21 c3t5 90664177 0 30 0 8630917997 6802 0 30 0 0 0 0 623116325 37 0 21 90664177 0 0 0
22 c3t6 149266517 0 30 2 506843827975 6802 0 30 0 0 0 0 656801831 39 0 21 149266517 0 0 0
23 c3t7 87340628 0 29 1 8633518363 6822 0 29 0 0 0 0 656801831 39 0 21 87340628 0 0 0
24 c4t0 189545523 0 30 0 4335705428 6802 0 30 0 0 0 0 623116325 37 0 23 189545523 0 0 0
25 c4t1 170974309 0 30 0 41051470 6802 0 30 0 0 0 0 639959078 38 0 21 170974309 0 0 0
26 c4t2 206835020 0 30 1 8630880522 6802 0 30 0 0 0 0 656801831 39 0 22 206835020 0 0 0
27 c4t3 56131886 0 30 8 43358185 6802 0 30 0 0 0 0 673644584 40 0 22 56131886 0 0 0
28 c4t4 47595425 0 30 2 40986081 6802 0 30 0 0 0 0 606273572 36 0 21 47595425 0 0 0
29 c4t5 111191905 0 30 0 41100684 6802 0 30 0 0 0 0 639959078 38 0 23 111191905 0 0 0
30 c4t6 118020682 0 30 0 4335644549 6802 0 30 0 0 0 0 623181861 37 0 20 118020682 0 0 0
31 c4t7 48975824 0 30 0 4337809471 6802 0 30 0 0 0 0 690487337 41 0 23 48975824 0 0 0
32 c5t0 164236644 0 30 0 40837732 6802 0 30 0 0 0 0 623116325 37 0 22 164236644 0 0 0
33 c5t1 164149755 0 30 2047 39861850 6802 0 30 0 20 0 0 639959078 38 0 21 164149755 0 0 0
34 c5t2 222857622 0 30 0 41069638 6802 0 30 0 0 0 0 656801831 39 0 21 222857622 0 0 0
35 c5t3 26578537 0 30 63 12928258145 6802 0 30 0 0 0 0 690487337 41 0 22 26578537 0 0 0
36 c5t4 47674425 0 30 0 4335792581 6801 0 30 0 0 0 0 606273572 36 0 22 47674425 0 0 0
37 c5t5 49987836 0 30 0 4335657030 6802 0 30 0 0 0 0 673644584 40 0 24 49987836 0 0 0
38 c5t6 131060750 0 29 0 41090773 6848 0 29 0 0 0 0 656801831 39 0 21 131060750 0 0 0
39 c5t7 259680 0 30 4 12928404237 6802 0 30 0 0 0 0 673644584 40 0 22 259680 0 0 0
40 c6t0 33501316 0 30 1891 8631534340 6801 0 30 0 9 0 0 589430819 35 0 21 33501316 1 1 0
41 c6t1 88093475 0 30 0 4336384462 6802 0 30 0 0 0 0 639959078 38 0 22 88093475 0 0 0
42 c6t2 12197846 0 30 2 8631181874 6802 0 30 0 0 0 0 673644584 40 0 23 12197846 0 0 0
43 c6t3 120037422 0 29 0 43593096 6821 0 29 0 0 0 0 724107307 43 0 23 120037422 0 0 0
44 c6t4 173604553 0 30 0 40575636 6801 0 30 0 0 0 0 606273572 36 0 22 173604553 0 0 0
45 c6t5 122179284 0 29 0 17139749 6854 0 29 0 0 0 0 673579048 40 0 23 122179284 0 0 0
46 c6t6 203753234 0 30 4 30082236234 6802 0 30 0 0 0 0 673579048 40 0 22 203753234 0 0 0
47 c6t7 150210603 0 28 0 4313753034 6847 0 29 0 0 0 0 690487337 41 0 21 150210603 0 0 0
The unnaturally high values in the 6th column for both of these disks refer to the
Reallocated sector count SMART value. Also the
Uncorrectable Errors for Host values are greater than zero (20 and 9) for these two disks.
Mappings:
HD33 maps |
c5t1 |
c5::dsk/c5t1d0 |
HD40 maps |
c6t0 |
c6::dsk/c6t0d0 |
The pool setup for this zpool:
root@t3fs11 $ zpool history
History for 'data1':
2010-08-17.15:44:09 zpool create -f data1 raidz2 c1t0d0 c1t5d0 c2t2d0 c2t7d0 c3t4d0 c4t1d0 c4t6d0 c5t3d0 c6t0d0
2010-08-17.15:44:15 zpool add -f data1 raidz2 c1t1d0 c1t6d0 c2t3d0 c3t0d0 c3t5d0 c4t2d0 c4t7d0 c5t4d0 c6t1d0
2010-08-17.15:44:20 zpool add -f data1 raidz2 c1t2d0 c1t7d0 c2t4d0 c3t1d0 c3t6d0 c4t3d0 c5t0d0 c5t5d0 c6t2d0
2010-08-17.15:44:25 zpool add -f data1 raidz2 c1t3d0 c2t0d0 c2t5d0 c3t2d0 c3t7d0 c4t4d0 c5t1d0 c5t6d0 c6t3d0
2010-08-17.15:44:30 zpool add -f data1 raidz2 c1t4d0 c2t1d0 c2t6d0 c3t3d0 c4t0d0 c4t5d0 c5t2d0 c5t7d0 c6t4d0
2010-08-17.15:44:33 zpool add -f data1 spare c6t7d0 c6t6d0 c6t5d0
Since I am expecting potential problems with the SCSI communications (as in the last problems, above), I want to remove these disks as good as possible from the active system.
root@t3fs11 $ zpool offline data1 c5t1d0
root@t3fs11 $ zpool offline data1 c6t0d0
root@t3fs11 $ cfgadm -c unconfigure c5::dsk/c5t1d0
root@t3fs11 $ cfgadm -c unconfigure c6::dsk/c6t0d0
root@t3fs11 $ zpool status
pool: data1
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
data1 DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t7d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c5t3d0 ONLINE 0 0 0
c6t0d0 OFFLINE 0 0 0
...
raidz2 DEGRADED 0 0 0
c1t3d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t7d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c5t1d0 OFFLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
...
Ok... replacing the first broken disk with one of the spare disks
root@t3fs11 $ zpool replace data1 c5t1d0 c6t7d0
The resilver started well as observed by zpool status... I can see no SCSI errors on the console over several minutes.
Getting a bit more daring... I try to resilver the second disk in parallel. The disks are in different RAID sets, so this should not hurt too much.
root@t3fs11 $ zpool replace data1 c6t0d0 c6t6d0
Seems to work.... about 8 hours to go. I will not run dcache on these nodes during that time... let's not push our luck.
The resilvering operation terminated successfully after 11 hours.
I exchanged both disks physically on 2010-11-17 15:30. Both were correctly marked by blue LEDs.
root@t3fs11 $ zpool status -x
pool: data1
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: http://www.sun.com/msg/ZFS-8000-2Q
scrub: resilver completed after 11h19m with 0 errors on Sat Nov 13 06:01:03 2010
config:
NAME STATE READ WRITE CKSUM
data1 DEGRADED 0 0 0
raidz2 DEGRADED 0 0 0
c1t0d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t7d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c5t3d0 ONLINE 0 0 0
spare DEGRADED 0 0 31.8M
c6t0d0 UNAVAIL 0 0 0 cannot open
c6t6d0 ONLINE 0 0 0 577G resilvered
raidz2 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t5d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c5t4d0 ONLINE 0 0 0
c6t1d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t6d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c5t0d0 ONLINE 0 0 0
c5t5d0 ONLINE 0 0 0
c6t2d0 ONLINE 0 0 0
raidz2 DEGRADED 0 0 0
c1t3d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t7d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
spare DEGRADED 0 0 31.5M
c5t1d0 UNAVAIL 0 0 0 cannot open
c6t7d0 ONLINE 0 0 0 578G resilvered
c5t6d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t6d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c5t2d0 ONLINE 0 0 0
c5t7d0 ONLINE 0 0 0
c6t4d0 ONLINE 0 0 0
spares
c6t7d0 INUSE currently in use
c6t6d0 INUSE currently in use
c6t5d0 AVAIL
errors: No known data errors
root@t3fs11 $ cfgadm -c configure c5::dsk/c5t1d0
root@t3fs11 $ zpool online data1 c5t1d0
warning: device 'c5t1d0' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present
Still, the disk remains in unavailable state when querying zpool status. Let's follow the message and issue a replace
root@t3fs11 $ zpool replace data1 c5t1d0
root@t3fs11 $ zpool status -x
...
c3t7d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
spare DEGRADED 0 0 31.5M
replacing DEGRADED 0 0 0
c5t1d0s0/o FAULTED 0 0 0 corrupted data
c5t1d0 ONLINE 0 0 0 526M resilvered
c6t7d0 ONLINE 0 0 0 8K resilvered
c5t6d0 ONLINE 0 0 0
...
Same procedure for second disk
root@t3fs11 $ cfgadm -c configure c6::dsk/c6t0d0
root@t3fs11 $ zpool online data1 c6t0d0
t3fs10 2010-11-22 drive failure (no automatic failover, scsi timeouts)
The zpool command blocks
zpool status -x
Fault manager
fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Nov 22 08:00:11 4913b692-b29e-682a-f09a-9491300fb237 DISK-8000-0X Major
Host : t3fs10
Platform : Sun Fire X4540 Chassis_id : 0949AMR021
Fault class : fault.io.disk.predictive-failure
Affects : dev:///:devid=id1,sd@n5000c50019c19f38//pci@0,0/pci10de,376@f/pci1000,1000@0/sd@7,0
faulted but still in service
FRU : "HD_ID_23" (hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR021:server-id=t3fs10:serial=9QJ5QJ4T:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=23/disk=0)
faulty
Description : SMART health-monitoring firmware reported that a disk
failure is imminent.
Refer to http://sun.com/msg/DISK-8000-0X for more information.
Response : None.
Impact : It is likely that the continued operation of
this disk will result in data loss.
Action : Schedule a repair procedure to replace the affected disk.
Use fmdump -v -u to identify the disk.
705 Mon Nov 22 08:58:07 2010 IPMI Log critical
ID = 17b : 11/22/2010 : 08:58:07 : Drive Slot : DBP/HDD23/STATE : Drive
Fault
In central messages log
Nov 22 08:00:11 t3fs10.psi.ch fmd: [ID 441519 daemon.error] SUNW-MSG-ID: DISK-8000-0X, TYPE: Fault, VER: 1, SEVERITY: Major
Nov 22 08:00:11 t3fs10.psi.ch EVENT-TIME: Mon Nov 22 08:00:11 CET 2010
Nov 22 08:00:11 t3fs10.psi.ch PLATFORM: Sun Fire X4540, CSN: 0949AMR021 , HOSTNAME: t3fs10
Nov 22 08:00:11 t3fs10.psi.ch SOURCE: eft, REV: 1.16
Nov 22 08:00:11 t3fs10.psi.ch EVENT-ID: 4913b692-b29e-682a-f09a-9491300fb237
On the console:
root@t3fs10 $ Nov 22 15:23:14 t3fs10 scsi: WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0 (mpt2):
Nov 22 15:23:14 t3fs10 Disconnected command timeout for Target 7
Nov 22 15:25:05 t3fs10 scsi: WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0 (mpt2):
Nov 22 15:25:05 t3fs10 Disconnected command timeout for Target 7
Nov 22 15:26:16 t3fs10 scsi: WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0 (mpt2):
Did a forced shutdown and reboot. Afterwards the zpool command did no longer block and I was able to get more diagnostics, again
root@t3fs10 $ hd -R
0 c1t0 125535798 0 19 0 12960759222 7049 0 19 0 0 0 1 454754331 27 0 20 125535798 0 0 0
1 c1t1 237015775 0 19 0 47328169527 7050 0 19 0 0 0 0 505217054 29 0 20 237015775 0 0 0
2 c1t2 182514992 0 19 1 4378448541 7050 0 19 0 0 0 0 522059807 31 0 19 182514992 0 0 0
3 c1t3 164619320 0 19 0 107455757220 7049 0 19 0 0 0 0 538968096 32 0 20 164619320 0 0 0
4 c1t4 164321560 0 19 0 141818656169 7050 0 19 0 0 0 0 454688795 27 0 20 164321560 0 0 0
5 c1t5 219147005 0 19 0 60206395946 7050 0 19 0 0 0 0 488439837 29 0 20 219147005 0 0 0
6 c1t6 83774341 0 19 3 8674093422 7050 0 19 0 0 0 0 505217054 30 0 19 83774341 0 0 0
7 c1t7 82812396 0 19 0 74561503 7049 0 19 0 0 0 0 522125343 31 0 19 82812396 0 0 0
8 c2t0 18252396 0 19 0 21556449891 7050 0 19 0 0 0 0 437846042 26 0 20 18252396 0 0 0
9 c2t1 59917419 0 19 1 85251527 7049 0 19 0 0 0 0 488374301 29 0 20 59917419 0 0 0
10 c2t2 106273402 0 19 0 34435821549 7049 0 19 0 0 0 0 488439837 29 0 19 106273402 0 0 0
11 c2t3 58113178 0 19 0 4379533373 7049 0 19 0 0 0 0 555810849 33 0 21 58113178 0 0 0
12 c2t4 111165675 0 19 0 189062989066 7049 0 19 0 0 0 0 437846042 26 0 20 111165675 0 0 0
13 c2t5 146995884 0 19 0 8671816057 7049 0 19 0 0 0 0 454754331 27 0 20 146995884 0 0 0
14 c2t6 191086390 0 19 0 4380041942 7049 0 19 0 0 0 0 505217053 29 0 20 191086390 0 0 0
15 c2t7 162089547 0 19 0 17255685134 7050 0 19 0 0 0 0 538968096 32 0 21 162089547 0 0 0
16 c3t0 176979997 0 19 0 34443372067 7049 0 19 0 0 0 0 421003289 24 0 20 176979997 0 0 0
17 c3t1 5548231 0 19 0 84007444 7049 0 19 0 0 0 0 454754331 27 0 20 5548231 0 0 0
18 c3t2 173448060 0 19 0 8671492950 7049 0 19 0 0 0 0 471597084 28 0 20 173448060 0 0 0
19 c3t3 4924500 0 19 0 773178340077 7049 0 19 0 0 0 0 488439837 29 0 19 4924500 0 0 0
20 c3t4 44076624 0 19 0 21551476816 7049 0 19 0 0 0 0 421003289 24 0 19 44076624 0 0 0
21 c3t5 225718031 0 19 0 83790115 7049 0 19 0 0 0 0 437911578 26 0 19 225718031 0 0 0
22 c3t6 175558069 0 19 0 38738911150 7049 0 19 0 0 0 0 488439837 29 0 20 175558069 0 0 0
23 c3t7 26969752 0 19 2005 81773334 7049 0 19 0 885 5 0 505282590 30 0 20 26969752 42 42 0
24 c4t0 9005087 0 19 0 12969472234 7049 0 19 0 0 0 0 421003289 25 0 20 9005087 0 0 0
25 c4t1 16782365 0 18 0 8666295329 7071 0 18 0 0 0 0 454688795 27 0 20 16782365 0 0 0
26 c4t2 24234100 0 19 0 47328536916 7049 0 19 0 0 0 0 488439837 29 0 20 24234100 0 0 0
27 c4t3 62181622 0 19 0 154703494002 7049 0 19 0 0 0 0 505282590 30 0 20 62181622 0 0 0
28 c4t4 91302033 0 19 0 21556859439 7049 0 19 0 0 0 0 421003289 25 0 20 91302033 0 0 0
29 c4t5 47181631 0 19 0 17265235415 7049 0 19 0 0 0 0 437911578 26 0 19 47181631 0 0 0
30 c4t6 196503721 0 19 2 4370861630 7049 0 19 0 0 0 0 471597084 28 0 20 196503721 0 0 0
31 c4t7 8413596 0 19 5 4378336008 7049 0 19 0 0 0 0 505282590 30 0 20 8413596 0 0 0
32 c5t0 189146743 0 18 3 12969013889 7049 0 18 0 0 0 0 404226072 24 0 20 189146743 0 0 0
33 c5t1 180453604 0 18 0 81297831 7049 0 18 0 0 0 0 454688795 27 0 19 180453604 0 0 0
34 c5t2 129281214 0 18 0 4379814706 7049 0 18 0 0 0 0 471597084 28 0 19 129281214 0 0 0
35 c5t3 45659498 0 18 1 4371302900 7049 0 18 0 0 0 0 488374301 29 0 19 45659498 0 0 0
36 c5t4 122992578 0 18 0 4378594187 7049 0 18 0 0 0 0 421003289 25 0 20 122992578 0 0 0
37 c5t5 201472395 0 18 1 38738707175 7049 0 18 0 0 0 0 454754331 27 0 20 201472395 0 0 0
38 c5t6 138606984 0 18 0 8671822466 7049 0 18 0 0 0 0 488439837 29 0 20 138606984 0 0 0
39 c5t7 116268514 0 18 0 8674426890 7049 0 18 0 0 0 0 505282590 30 0 20 116268514 0 0 0
40 c6t0 55061066 0 18 0 60204989916 7049 0 18 0 0 0 0 421068825 25 0 20 55061066 0 0 0
41 c6t1 10142376 0 18 1 429580868094 7049 0 18 0 0 0 0 454754331 27 0 20 10142376 0 0 0
42 c6t2 81596927 0 18 0 12969434046 7049 0 18 0 0 0 0 471531547 27 0 19 81596927 0 0 0
43 c6t3 35092129 0 18 1 4376786413 7049 0 18 0 0 0 0 488374301 29 0 20 35092129 0 0 0
44 c6t4 57483251 0 18 1 30150093388 7049 0 18 0 0 0 0 421003289 25 0 20 57483251 0 0 0
45 c6t5 170939821 0 18 918 34375768257 7049 0 18 0 0 0 0 471597084 28 0 21 170939821 3 3 0
46 c6t6 168553543 0 18 0 15841561 7049 0 18 0 0 0 0 471531548 28 0 19 168553543 0 0 0
47 c6t7 171788766 0 18 0 4310892064 7049 0 18 0 0 0 0 505282590 30 0 20 171788766 0 0 0
Mappings:
HD23 maps |
c3t7 |
c3::dsk/c3t7d0 |
zpool offline data1 c3t7d0
cfgadm -c unconfigure c3::dsk/c3t7d0
zpool replace data1 c3t7d0 c6t7d0
The spare disk seems to be correctly resilvering. I will start file services and dcache again.
root@t3fs10 $ zpool status
pool: data1
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Online the device using 'zpool online' or replace the device with
'zpool replace'.
scrub: resilver in progress for 0h4m, 0.55% done, 12h2m to go
config:
...
raidz2 DEGRADED 0 0 0
c1t3d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
spare DEGRADED 0 0 226K
c3t7d0 OFFLINE 0 0 0
c6t7d0 ONLINE 0 0 0 3.98G resilvered
c4t4d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
....
# Later
scrub: resilver completed after 12h39m with 0 errors on Tue Nov 23 04:41:01 2010
List of occurences year 2011:
2011-01-10 t3fs10 drive failure (of a unused spare disk)
The node was marked by yellow service LED, fmadm manager and ILOM log, but
zpool status -x
shows that all pools are healthy. The fault is flagged as an immanent failure due to SMART monitoring.
root@t3fs10 $ fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Dec 14 09:34:24 b6f7ad51-5cfc-4f2b-bab0-a57d939eadde DISK-8000-0X Major
Host : t3fs10
Platform : Sun Fire X4540 Chassis_id : 0949AMR021
Fault class : fault.io.disk.predictive-failure
Affects : dev:///:devid=id1,sd@n5000c50019c55a36//pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@5,0
faulted but still in service
FRU : "HD_ID_45" (hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR021:server-id=t3fs10:serial=9QJ5R4WX:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=45/disk=0)
faulty
Description : SMART health-monitoring firmware reported that a disk
failure is imminent.
Refer to http://sun.com/msg/DISK-8000-0X for more information.
Response : None.
Impact : It is likely that the continued operation of
this disk will result in data loss.
Action : Schedule a repair procedure to replace the affected disk.
Use fmdump -v -u to identify the disk.
748 Mon Jan 10 11:28:48 2011 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
747 Tue Dec 14 09:34:11 2010 IPMI Log critical
ID = 195 : 12/14/2010 : 09:34:11 : Drive Slot : DBP/HDD45/STATE : Drive
Fault
746 Wed Dec 1 15:49:28 2010 IPMI Log critical
ID = 194 : 12/01/2010 : 15:49:28 : Drive Slot : DBP/HDD23/STATE : Drive
Fault (PREVIOUS FAULT... ALREADY FIXED)
SMART values show a high
Reallocated sector count
hd -R
...
42 c6t2 79087142 0 18 0 12978519190 8221 0 18 0 0 0 0 538640414 30 0 19 79087142 0 0 0
43 c6t3 81712106 0 18 1 4386173981 8221 0 18 0 0 0 0 555483167 31 0 20 81712106 0 0 0
44 c6t4 106459455 0 18 1 34454857772 8221 0 18 0 0 0 0 471334939 27 0 20 106459455 0 0 0
45 c6t5 171469164 0 18 1739 34378481112 8221 0 18 0 0 0 0 538705951 31 0 21 171469164 4 4 0
46 c6t6 169113078 0 18 0 18445266 8221 0 18 0 0 0 0 538640414 30 0 19 169113078 0 0 0
47 c6t7 126977783 0 18 0 4318805528 8221 0 18 0 0 0 0 572391456 33 0 20 126977783 0 0 0
mapping to the disk name
root@t3fs10 $ hd -w /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@5,0
c6t5 = /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@5,0
Funny enough, this disk is actually a spare disk of the configuration. I do not understand how the reallocated sector count problem was able to arise there.
root@t3fs10 $ zpool offline data1 c6t5d0
cannot offline c6t5d0: device is reserved as a hot spare
root@t3fs10 $ cfgadm -c unconfigure c6::dsk/c6t5d0
cfgadm: Hardware specific failure: failed to unconfigure SCSI device: Device busy
root@t3fs10 $ zpool detach data1 c6t5d0
cannot detach c6t5d0: device is reserved as a hot spare
root@t3fs10 $ zpool remove data1 c6t5d0 # THAT ONE WORKS!!!!!
cfgadm -c unconfigure c6::dsk/c6t5d0
Replaced the disk, then bringing it into the configuration, again
cfgadm -c configure c6::dsk/c6t5d0
zpool add -f data1 spare c6t5d0
fmadm faulty
shows no problems any more.
2011-01-25 t3fs11 drive failure (no automatic failover, scsi timeouts)
zpool status hangs.
fmadm faulty
takes a long time to return (minutes).
root@t3fs11 $ fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jan 25 14:58:34 145d27a2-93c2-e3d0-8719-c063b734b1a9 DISK-8000-0X Major
Host : t3fs11
Platform : Sun Fire X4540 Chassis_id : 0947AMR033
Fault class : fault.io.disk.predictive-failure
Affects : dev:///:devid=id1,sd@n5000c50019ce2eb9//pci@0,0/pci10de,375@b/pci1000,1000@0/sd@6,0
faulted but still in service
FRU : "HD_ID_14" (hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5T450:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=14/disk=0)
faulty
Description : SMART health-monitoring firmware reported that a disk
failure is imminent.
Refer to http://sun.com/msg/DISK-8000-0X for more information.
Response : None.
Impact : It is likely that the continued operation of
this disk will result in data loss.
Action : Schedule a repair procedure to replace the affected disk.
Use fmdump -v -u to identify the disk.
SP logs. Notice, that the two older failures have already been repaired (see above). But it shows that there were no hints or things like temperature problems between those failures and now.
ID Date/Time Class Type Severity
----- ------------------------ -------- -------- --------
1792 Wed Jan 26 14:11:36 2011 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
1791 Tue Jan 25 14:57:58 2011 IPMI Log critical
ID = 3b4 : 01/25/2011 : 14:57:58 : Drive Slot : DBP/HDD14/STATE : Drive
Fault
1790 Thu Nov 18 17:51:08 2010 IPMI Log critical
ID = 3b3 : 11/18/2010 : 17:51:08 : Drive Slot : DBP/HDD40/STATE : Drive
Fault
1789 Thu Nov 18 17:51:07 2010 IPMI Log critical
ID = 3b2 : 11/18/2010 : 17:51:07 : Drive Slot : DBP/HDD33/STATE : Drive
Fault
On the system console
Jan 26 14:29:29 t3fs11 Disconnected command timeout for Target 6
Jan 26 14:29:30 t3fs11 scsi: WARNING: /pci@0,0/pci10de,375@b/pci1000,1000@0/sd@6,0 (sd22):
Jan 26 14:29:30 t3fs11 SCSI transport failed: reason 'reset': giving up
I had to shut down the system forcefully through the ILOM.
The system came up again in apparently a healthy state
root@t3fs11 $ zpool status -x
all pools are healthy
Checking the SMART table manually reveals also for this disk a high
Reallocated sector count.
hd -R
...
11 c2t3 113880158 0 30 0 4355539612 8618 0 30 0 0 0 0 538968096 32 0 21 113880158 0 0 0
12 c2t4 47366463 0 31 1 111728827925 8598 0 31 0 0 0 0 454688795 27 0 21 47366463 0 0 0
13 c2t5 72532704 0 31 0 8647554845 8598 0 31 0 0 0 0 488374301 28 0 21 72532704 0 0 0
14 c2t6 214809788 0 31 2047 17239657814 8598 0 31 0 5028 17180131357 0 522059807 31 0 21 214809788 0 0 0
15 c2t7 46182055 0 31 1 8650188048 8598 0 31 0 0 0 0 555745313 33 0 22 46182055 0 0 0
16 c3t0 235541355 0 30 1 4353410746 8598 0 30 0 0 0 0 437846042 26 0 21 235541355 0 0 0
...
Initiating a manual replacement of the disk in the zpool configuration:
root@t3fs11 $ zpool replace data1 c2t6d0 c6t7d0
root@t3fs11 $ zpool status -x
pool: data1
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h0m, 0.11% done, 8h25m to go
config:
NAME STATE READ WRITE CKSUM
...
raidz2 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
spare ONLINE 0 0 0
c2t6d0 ONLINE 0 0 0
c6t7d0 ONLINE 0 0 0 523M resilvered
c3t3d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c5t2d0 ONLINE 0 0 0
c5t7d0 ONLINE 0 0 0
c6t4d0 ONLINE 0 0 0
spares
c6t7d0 INUSE currently in use
c6t6d0 AVAIL
c6t5d0 AVAIL
turned on dcache.
A few minutes later the system blocked again. Again shutdown and came up fine. It seems that the OS still tries to communicate with the broken disk, even though it is being replaced!
I tried to offline and unconfigure the disk to prevent any communications. I also kept the machine out of dcache, so no I/O at all, except for the resilvering, would take place.
root@t3fs11 $ zpool offline data1 c2t6d0
root@t3fs11 $ cfgadm -c unconfigure c2::dsk/c2t6d0
This worked, albeit it took 2h more than the projected time... 8.5 hours until into the night. Resilver looks ok.
I started dcache again the following morning, and the system seems to run stably.
cfgadm -c configure c2::dsk/c2t6d0
zpool replace data1 c2t6d0
2011-02-22 t3fs11 immanent drive failure (no automatic failover yet)
ILOM recorded an immanent drive failure and sent mail.
Gather information:
root@t3fs11 $ fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Feb 22 14:49:40 7a731175-c212-4def-d3f7-de73f2db6441 DISK-8000-0X Major
Host : t3fs11
Platform : Sun Fire X4540 Chassis_id : 0947AMR033
Fault class : fault.io.disk.predictive-failure
Affects : dev:///:devid=id1,sd@n5000c50019e13105//pci@3c,0/pci10de,377@a/pci1000,1000@0/sd@3,0
faulted but still in service
FRU : "HD_ID_27" (hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5XGFF:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=27/disk=0)
faulty
Description : SMART health-monitoring firmware reported that a disk
failure is imminent.
Refer to http://sun.com/msg/DISK-8000-0X for more information.
Response : None.
Impact : It is likely that the continued operation of
this disk will result in data loss.
Action : Schedule a repair procedure to replace the affected disk.
Use fmdump -v -u to identify the disk.
root@t3fs11 $ fmdump -v -u7a731175-c212-4def-d3f7-de73f2db6441
TIME UUID SUNW-MSG-ID
Feb 22 14:49:41.1532 7a731175-c212-4def-d3f7-de73f2db6441 DISK-8000-0X
100% fault.io.disk.predictive-failure
Problem in: hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5XGFF:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=27/disk=0
Affects: dev:///:devid=id1,sd@n5000c50019e13105//pci@3c,0/pci10de,377@a/pci1000,1000@0/sd@3,0
FRU: hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5XGFF:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=27/disk=0
Location: HD_ID_27
From the
hd
utility we can see that slot 27 maps to c4t3
What does ZFS know?
root@t3fs11 $ zpool status -x
all pools are healthy
root@t3fs11 $ zpool status data1
# this command still works perfectly. ZFS is not yet aware that something is wrong.
Lets investigate the SMART tables
root@t3fs11 $ hd -R
...
23 c3t7 90131225 0 31 2 8655799187 9269 0 31 0 0 0 0 555548704 32 0 21 90131225 0 0 0
24 c4t0 84145097 0 32 2 8654720307 9249 0 32 0 0 0 0 505020445 29 0 23 84145097 0 0 0
25 c4t1 134640711 0 32 0 63177304 9249 0 32 0 0 0 0 521797662 30 0 21 134640711 0 0 0
26 c4t2 143683009 0 32 1 8653043032 9249 0 32 0 0 0 0 555483168 32 0 22 143683009 0 0 0
27 c4t3 172966231 0 32 2043 67914384 9249 0 32 0 130 0 0 572391457 33 0 22 172966231 4 4 0
28 c4t4 2460622 0 32 3 63332723 9249 0 32 0 0 0 0 488112155 27 0 21 2460622 0 0 0
29 c4t5 155198022 0 32 0 65310109 9249 0 32 0 0 0 0 521863198 30 0 23 155198022 0 0 0
...
Information from smartmon
root@t3fs11 $ smartctl -a /dev/rdsk/c4t3d0
smartctl version 5.38 [i386-pc-solaris2.8] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: SEAGATE ST31000NSSUN1.0T 094555XGFF
Serial Number: 9QJ5XGFF
Firmware Version: SU0E
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Wed Feb 23 14:07:04 2011 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
...
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
...
5 Reallocated_Sector_Ct 0x0033 001 001 036 Pre-fail Always FAILING_NOW 2043
...
...
OK. We want to take this disk out of the active pool and immediately substitute it with one of the spares.
We need to get the name mapping for the unconfigure command
root@t3fs11 $ cfgadm -a
Ap_Id Type Receptacle Occupant Condition
c1 scsi-bus connected configured unknown
c1::dsk/c1t0d0 disk connected configured unknown
...
c4::dsk/c4t3d0 disk connected configured unknown
...
We offline the disk and prevent the system to further interact with it
root@t3fs11 $ zpool offline data1 c4t3d0
root@t3fs11 $ cfgadm -c unconfigure c4::dsk/c4t3d0
We initialize the resilvering to a spare
root@t3fs11 $ zpool replace data1 c4t3d0 c6t6d0
2011-03-12 t3fs07 disk c2t0d0 failure
ILOM recorded a drive failure and sent an e-mail.
ID = 1ce : 03/12/2011 : 02:29:05 : Drive Slot : DBP/HDD8/STATE : Drive Fault
Also smartd
Device: /dev/rdsk/c2t0d0, FAILED SMART self-check. BACK UP DATA NOW!
Information by smartmon:
root@t3fs07 $ smartctl -a /dev/rdsk/c2t0d0
smartctl version 5.38 [i386-pc-solaris2.8] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: SEAGATE ST31000NSSUN1.0T 094455T8NT
Serial Number: 9QJ5T8NT
Firmware Version: SU0E
User Capacity: 1,000,204,886,016 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 4
Local Time is: Sun Mar 13 11:15:35 2011 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
...
What does ZFS know?
root@t3fs07 $ zpool status -x
pool: data1
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scrub: resilver completed after 9h42m with 0 errors on Sat Mar 12 11:04:51 2011
...
spare DEGRADED 0 0 28.4M
c2t0d0 REMOVED 0 0 0
c6t5d0 ONLINE 0 0 0 515G resilvered
Remove the disk from ZFS:
root@t3fs07 $ zpool offline data1 c2t0d0
It produces a transition from REMOVED => OFFLINE:
root@t3fs07 $ zpool status -x|grep c2t0d0
c2t0d0 OFFLINE 0 0 0
Remove the disk from Solaris:
root@t3fs07 $ cfgadm -a | grep c2t0d0
c2::dsk/c2t0d0 disk connected configured unknown
root@t3fs07 $ cfgadm -c unconfigure c2::dsk/c2t0d0
root@t3fs07 $ cfgadm -a | grep c2t0d0
c2::rdsk/c2t0d0 disk connected unconfigured unknown
We changed the disk with a spare one, after the physical change we got a state transition to 'configured':
root@t3fs07 $ cfgadm -a | grep c2t0d0
c2::dsk/c2t0d0 disk connected configured unknown
and in ZFS transition OFFLINE => REMOVED:
root@t3fs07 $ zpool status | grep c2t0d0
c2t0d0 REMOVED 0 0 0
Replace the hot spare with the new disk:
root@t3fs07 $ zpool replace data1 c2t0d0
root@t3fs07 $ zpool status
...
scrub: resilver in progress for 0h0m, 0.16% done, 10h5m to go
...
raidz2 DEGRADED 0 0 0
c1t3d0 ONLINE 0 0 0
spare DEGRADED 0 0 28.4M
replacing DEGRADED 0 0 0
c2t0d0s0/o FAULTED 0 0 0 corrupted data
c2t0d0 ONLINE 0 0 0 984M resilvered
c6t5d0 ONLINE 0 0 0 94K resilvered
...
Opened the case SR 3-3171818131 with Oracle about the broken disk SN. 9QJ5T8NT
ST31000340NS on system SUN FIRE X4540 SN.0949AMR020 replaced with disk SN. 9QJ5KV96
ST31000340NS.
2011-03-13 t3fs07 disk c2t6d0 proactive maintenance
Disk c2t6d0 was found plenty of errors, but not failed, ZFS replaced it by using the spare disk c6t6d0:
root@t3fs07 $ smartctl -a /dev/rdsk/c2t6d0
smartctl version 5.38 [i386-pc-solaris2.8] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: SEAGATE ST31000NSSUN1.0T 094455V7M5
Serial Number: 9QJ5V7M5
...
root@t3fs07 $ zpool status
spare DEGRADED 0 0 620K
c2t6d0 FAULTED 28 0 0 too many errors
c6t6d0 ONLINE 0 0 0
we got SMARTd e-mail like:
Device: /dev/rdsk/c2t6d0, 332 Offline uncorrectable sectors
So we decided to replace it with a new disk SN.9QJ5R6W5; we ran:
root@t3fs07 $ cfgadm -a | grep c2t6d0
c2::dsk/c2t6d0 disk connected configured unknown
root@t3fs07 $ cfgadm -c unconfigure c2::dsk/c2t6d0
root@t3fs07 $ cfgadm -a | grep c2t6d0
c2::rdsk/c2t6d0 disk connected unconfigured unknown
and suddenly we got this ILOM e-mail:
ID = 1d3 : 03/13/2011 : 18:33:42 : Drive Slot : DBP/HDD14/STATE : Hot Spare
After the disk change we ran:
root@t3fs07 $ zpool replace data1 c2t6d0
and zpool status reports:
...
spare DEGRADED 0 0 620K
replacing DEGRADED 0 0 0
c2t6d0s0/o FAULTED 28 0 0 too many errors
c2t6d0 ONLINE 0 0 0 1.46G resilvered
c6t6d0 ONLINE 0 0 0 107K resilvered
c3t3d0 ONLINE 0 0 0
...
Again We updated Oracle case SR 3-3171818131.
2011-03-14 t3fs07 Spare disk staying in an apparently healthy raidz2 vdev. Needed to manually remove it.
For some strange reason there was a spare already involved in the first raidz vdev of the pool. The disk it replaced and itself both looked good and seemed to serve the same purpose in the raidz. We tried to put it back into the unused spare array.
root@t3fs07 $ zpool status data1
pool: data1
state: ONLINE
scrub: resilver completed after 7h37m with 0 errors on Mon Mar 14 01:23:01 2011
config:
NAME STATE READ WRITE CKSUM
data1 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
spare ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c6t7d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t7d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
...
Detaching the spare helped
root@t3fs07 $ zpool detach data1 c6t7d0
root@t3fs07 $ zpool status data1
pool: data1
state: ONLINE
scrub: resilver completed after 7h37m with 0 errors on Mon Mar 14 01:23:01 2011
config:
NAME STATE READ WRITE CKSUM
data1 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t7d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c5t3d0 ONLINE 0 0 0
c6t0d0 ONLINE 0 0 0
2011-03-15 t3fs07 ILOM reports disk failure but fmadm is unaware of it - ILOM bug?
Email message from ILOM and ILOM logs contain the following
ID = 1d7 : 03/15/2011 : 15:47:00 : Drive Slot : DBP/HDD8/STATE : Drive Fault
Checking fmadm and ZFS status shows that OS and FS are unaware of any fault. Also the chassis has no yellow service LED lighted.
root@t3fs07 $ fmadm faulty
root@t3fs07 $ zpool status -x
all pools are healthy
The
hd
tool shows that HDD8 maps to c2t0. There do not seem to be extraordinarily many SMART failures on that disk
root@t3fs07 $ hd -R
0 c1t0 77039094 0 22 0 8731708701 9762 0 22 0 0 0 0 555286555 27 0 20 77039094 0 0 0
1 c1t1 194958566 0 22 44 141875462526 9763 0 22 0 0 0 0 589037598 30 0 20 194958566 0 0 0
2 c1t2 91496381 0 22 0 17325275533 9762 0 22 0 0 0 0 589037598 30 0 19 91496381 0 0 0
3 c1t3 182739422 0 91 0 78586645 12492 0 89 0 0 1 0 638976033 33 0 17 182739422 0 0 0
4 c1t4 159143012 0 22 0 348032004841 9762 0 22 0 0 0 0 521666585 25 0 19 159143012 0 0 0
5 c1t5 73493979 0 23 14 8734333959 9762 0 24 0 8 0 0 589037597 29 0 20 73493979 0 0 0
6 c1t6 27164439 0 22 0 25912081303 9763 0 22 0 0 0 0 605880350 30 0 20 27164439 0 0 0
7 c1t7 63012202 0 22 0 8734480074 9762 0 22 0 0 0 0 622723103 31 0 20 63012202 0 0 0
8 c2t0 234259907 0 38 0 46188498 9044 0 38 0 0 0 0 538247194 26 0 11 234259907 0 0 0
9 c2t1 16728995 0 21 1 13023489736 9782 0 21 0 0 0 0 589037597 29 0 21 16728995 0 0 0
I think that with high probability there may be some malfunction in the ILOM. I am resetting it now.
2011-04-21 t3fs11 Nagios reports disk failure but we didn't get the ILOM e-mail or the SMARTd e-mail
Nagios
Nagios was the first to realize about the problem ( because its active checks ):
Notification Type: PROBLEM
Service: check_zfs_data1
Host: t3fs11
Address: 192.33.123.51
State: WARNING
Date/Time: 04-21-2011 02:42:08
Additional Info:
WARNING ZPOOL data1 : DEGRADED {Size:40.6T Used:26.5T Avail:14.1T Cap:65%} raidz2:DEGRADED (c4t4d0:FAULTED)
fmadm
This is what fmadm reports:
root@t3fs11 $ fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Apr 21 02:30:30 2e66bdd5-5534-c779-a89e-ee6fba716380 ZFS-8000-FD Major
Host : t3fs11
Platform : Sun Fire X4540 Chassis_id : 0947AMR033
Fault class : fault.fs.zfs.vdev.io
Affects : zfs://pool=data1/vdev=c8e86b7110798cbe
faulted and taken out of service
Problem in : zfs://pool=data1/vdev=c8e86b7110798cbe
faulted and taken out of service
Description : The number of I/O errors associated with a ZFS device exceeded
acceptable levels. Refer to http://sun.com/msg/ZFS-8000-FD
for more information.
Response : The device has been offlined and marked as faulted. An attempt
will be made to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Run 'zpool status -x' and replace the bad device.
SMARTd
SMARTd was aware of the problem, but any e-mail was sent; after a kill and restart of SMARTd I got 2 e-mails:
Device: /dev/rdsk/c4t4d0, 353 Currently unreadable (pending) sectors
Device: /dev/rdsk/c4t4d0, 353 Offline uncorrectable sectors
about these missed SMARTd e-mails, I guess it was my fault because I changed a disk in t3fs11 some days ago and I didn't kill/restart SMARTd with the command:
nohup /opt/csw/sbin/smartd -q never -d &
netxt time that I'll change a disk I'll reset SMARTd as well.
ILOM
We got the ILOM e-mail several hours later, luckily..
ID = 412 : 04/21/2011 : 11:57:49 : Drive Slot : DBP/HDD24/STATE : Drive Fault
2011-04-23 t3fs11 disk failure, scsi timeouts, no autumatic ZFS failover
1994 Tue Aug 23 22:19:05 2011 IPMI Log critical
ID = 419 : 08/23/2011 : 22:19:05 : Drive Slot : DBP/HDD9/STATE : Drive F
ault
1993 Tue Aug 23 22:18:22 2011 IPMI Log critical
ID = 418 : 08/23/2011 : 22:18:22 : Drive Slot : DBP/HDD9/STATE : Drive F
ault
1985 Fri Aug 19 16:31:25 2011 Audit Log minor
root : Close Session : object = /session/type : value = shell : success
1984 Fri Aug 19 15:50:09 2011 Audit Log minor
root : Open Session : object = /session/type : value = shell : success
1983 Mon Aug 15 20:11:29 2011 Email Connection major
Alert rule 1: Failed to open smtp connection
1982 Mon Aug 15 20:08:19 2011 IPMI Log critical
ID = 415 : 08/15/2011 : 20:08:19 : Drive Slot : DBP/HDD9/STATE : Drive F
ault
1981 Thu Apr 21 17:09:51 2011 IPMI Log critical
ID = 414 : 04/21/2011 : 17:09:51 : Drive Slot : DBP/HDD28/STATE : Hot Sp
are
1980 Thu Apr 21 16:48:44 2011 IPMI Log critical
ID = 413 : 04/21/2011 : 16:48:44 : Drive Slot : DBP/HDD28/STATE : Hot Sp
are
1979 Thu Apr 21 11:57:49 2011 IPMI Log critical
ID = 412 : 04/21/2011 : 11:57:49 : Drive Slot : DBP/HDD24/STATE : Drive
Fault
1978 Wed Apr 20 12:44:24 2011 Audit Log minor
root : Close Session : object = /session/type : value = shell : success
# fmadm takes a long time (minutes):
root@t3fs11 $ fmadm faulty
Aug 24 10:04:43 t3fs11 scsi: WARNING: /pci@0,0/pci10de,375@b/pci1000,1000@0 (mpt1):
Aug 24 10:04:43 t3fs11 Disconnected command timeout for Target 1
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 23 22:18:52 febe5ca3-9978-c5bc-c551-e0d74e165743 DISK-8000-0X Major
Host : t3fs11
Platform : Sun Fire X4540 Chassis_id : 0947AMR033
Fault class : fault.io.disk.predictive-failure
Affects : dev:///:devid=id1,sd@n5000c50019d0c890//pci@0,0/pci10de,375@b/pci1000,1000@0/sd@1,0
faulted but still in service
FRU : "HD_ID_9" (hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5RCC2:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=9/disk=0)
faulty
Description : SMART health-monitoring firmware reported that a disk
failure is imminent.
Refer to http://sun.com/msg/DISK-8000-0X for more information.
Response : None.
Impact : It is likely that the continued operation of
this disk will result in data loss.
Action : Schedule a repair procedure to replace the affected disk.
Use fmdump -v -u to identify the disk.
As in earlier cases the system had to be forcefully shut down. I also reset the SP since we had had problems with receiving email messages from it, so I wanted to start from a clean system. As before, the restart of the system involved some longer waiting time in the initialization phase.
-> stop -force /SYS
Are you sure you want to immediately stop /SYS (y/n)? y
Stopping /SYS immediately
-> reset /SP
Are you sure you want to reset /SP (y/n)? y
Performing reset on /SP
I started the system again, and it came up cleanly. ZFS seems currently to be ignorant of the disk problem. It is just detected at the ILOM/fmadm level
root@t3fs11 $ zpool status -x
all pools are healthy
reading out the SMART values
root@t3fs11 $ hd -R
0 c1t0 17249968 0 33 5 373763131589 13633 0 33 0 0 0 0 387252247 23 0 21 17249968 0 0 0
1 c1t1 170699931 0 33 36 12983724431 13632 0 33 0 0 0 0 437780506 26 0 23 170699931 0 0 0
2 c1t2 243231875 0 33 1 107479759363 13633 0 33 0 0 0 0 471466012 28 0 22 243231875 0 0 0
3 c1t3 131051304 0 32 2 100680997 13652 0 32 0 0 0 0 521994271 31 0 23 131051304 0 0 0
4 c1t4 195635912 0 33 3 102820889 13633 0 33 0 0 0 0 387252247 23 0 21 195635912 0 0 0
5 c1t5 90580622 0 33 0 932108957468 13633 0 33 0 1 0 0 420937753 25 0 21 90580622 0 0 0
6 c1t6 186048626 0 33 0 34457395216 13633 0 33 0 0 0 0 437780506 26 0 21 186048626 0 0 0
7 c1t7 181648238 0 32 0 8698160007 13651 0 32 0 0 0 0 471531548 28 0 22 181648238 0 0 0
8 c2t0 137122786 0 34 3 206262087601 13632 0 34 0 0 0 0 387186711 23 0 20 137122786 0 0 0
9 c2t1 76656715 0 34 2045 17284731974 13632 0 34 0 172 4295032833 0 404160536 24 0 22 76656715 2 2 0
10 c2t2 225594443 0 34 21 102963032 13632 0 34 0 0 0 0 437780506 26 0 21 225594443 0 0 0
11 c2t3 144645751 0 33 0 4396831575 13652 0 33 0 0 0 0 454688795 27 0 21 144645751 0 0 0
12 c2t4 161275767 0 34 1 1374496400531 13632 0 34 0 0 0 0 387186711 23 0 20 161275767 0 0 0
13 c2t5 165470843 0 34 1 8692164285 13632 0 34 0 0 0 0 404095000 24 0 21 165470843 0 0 0
14 c2t6 5576874 0 32 0 47547043 6202 0 32 0 0 0 0 421003289 25 0 17 5576874 0 0 0
15 c2t7 179049698 0 34 1 8695524291 13632 0 34 0 0 0 0 471531548 28 0 22 179049698 0 0 0
16 c3t0 181158015 0 33 6 12985284255 13632 0 33 0 0 0 0 370409494 22 0 20 181158015 0 0 0
17 c3t1 131064129 0 33 3 90302906083 13632 0 33 0 0 0 0 420937753 25 0 23 131064129 0 0 0
18 c3t2 18212729 0 34 8 17283364757 13632 0 34 0 0 0 0 454623259 27 0 23 18212729 0 0 0
19 c3t3 238229364 0 21 0 3256974 66 0 21 0 0 0 0 454688795 27 0 20 238229364 0 0 0
20 c3t4 179090552 0 34 11 38757158962 13632 0 34 0 0 0 0 387252247 23 0 21 179090552 0 0 0
21 c3t5 139475973 0 34 0 43049644397 13632 0 34 0 0 0 0 387317783 23 0 21 139475973 0 0 0
22 c3t6 21702988 0 34 9 1052372408231 13632 0 34 0 0 0 0 421003289 25 0 21 21702988 0 0 0
23 c3t7 200787145 0 33 2 17285706890 13652 0 33 0 2 0 0 437846042 26 0 21 200787145 0 0 0
24 c4t0 219100722 0 26 0 4327199423 3091 0 26 0 0 0 0 370409494 22 0 16 219100722 0 0 0
25 c4t1 171714245 0 34 1 124657330319 13632 0 34 0 0 0 0 404095000 24 0 21 171714245 0 0 0
26 c4t2 210424931 0 34 13 12984965035 13632 0 34 0 0 0 0 421003289 26 0 22 210424931 0 0 0
27 c4t3 200584910 0 11 0 57403047 6582 0 11 0 0 0 0 471466012 28 0 19 200584910 0 0 0
28 c4t4 26488941 0 25 0 29022271 2998 0 25 0 0 0 0 387252247 23 0 18 26488941 0 0 0
29 c4t5 47445573 0 34 0 184788229491 13632 0 34 0 0 0 0 404160536 24 0 23 47445573 0 0 0
30 c4t6 12512631 0 34 0 4398114222 13632 0 34 0 0 0 0 404095000 23 0 20 12512631 0 0 0
31 c4t7 121268631 0 34 0 1408851383524 13632 0 34 0 0 0 0 454688795 27 0 23 121268631 0 0 0
32 c5t0 80681234 0 34 1 738842372009 13632 0 34 0 0 0 0 387252247 23 0 21 80681234 0 0 0
33 c5t1 193436627 0 8 1 58130853 6713 0 8 0 0 0 0 404160536 24 0 23 193436627 0 0 0
34 c5t2 57580333 0 34 0 104257729 13632 0 34 0 0 0 0 420937753 25 0 21 57580333 0 0 0
35 c5t3 2134880 0 34 80 12990605345 13632 0 34 0 0 0 0 437846042 26 0 22 2134880 0 0 0
36 c5t4 82431556 0 34 0 12984848023 13632 0 34 0 0 0 0 370409494 22 0 20 82431556 0 0 0
37 c5t5 210183630 0 34 0 4402858349 13632 0 34 0 0 0 0 437780506 26 0 24 210183630 0 0 0
38 c5t6 195477990 0 33 0 103055322 13678 0 33 0 0 0 0 420937753 25 0 21 195477990 0 0 0
39 c5t7 43367020 0 34 4 25876628588 13632 0 34 0 0 0 0 437846042 26 0 22 43367020 0 0 0
40 c6t0 78208807 0 40 0 8684273605 11990 0 39 0 0 0 0 387252247 23 0 14 78208807 0 0 0
41 c6t1 229417618 0 34 0 4395271366 13632 0 34 0 0 0 0 404095000 24 0 22 229417618 0 0 0
42 c6t2 66862671 0 34 5 12993000752 13632 0 34 0 0 0 0 421003289 25 0 23 66862671 0 0 0
43 c6t3 107140408 0 33 0 4400705110 13651 0 33 0 0 0 0 454688795 27 0 23 107140408 0 0 0
44 c6t4 53175997 0 34 6 322226459394 13632 0 34 0 0 4295032833 0 370409494 22 0 20 53175997 0 0 0
45 c6t5 125733869 0 33 0 32685491 13684 0 33 0 0 0 0 404160536 24 0 23 125733869 0 0 0
46 c6t6 240801455 0 34 5 30110127601 13632 0 34 0 0 0 0 421003289 25 0 22 240801455 0 0 0
47 c6t7 176375110 0 32 0 4351157937 13677 0 33 0 0 0 0 437846042 26 0 21 176375110 0 0 0
Mapping:
c2t1d0 |
c2::dsk/c2t1d0 |
/pci@0,0/pci10de,375@b/pci1000,1000@0/sd@1,0 |
Offlining the disk and initiating the resilver
root@t3fs11 $ zpool offline data1 c2t1d0
cfgadm -c unconfigure c2::dsk/c2t1d0
root@t3fs11 $ zpool replace data1 c2t1d0 c6t6d0
List of occurences year 2013:
05-08-2013 - t3fs07 OS/disks crash
Server almost totally frozen
Aug 5 16:53:22 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug 5 16:53:22 t3fs07 Disconnected command timeout for Target 1
Aug 5 16:54:33 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug 5 16:54:33 t3fs07 Disconnected command timeout for Target 1
Aug 5 16:55:44 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug 5 16:55:44 t3fs07 Disconnected command timeout for Target 1
Aug 5 16:55:45 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@1,0 (sd33):
Aug 5 16:55:45 t3fs07 SCSI transport failed: reason 'reset': giving up
Aug 5 16:56:55 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug 5 16:56:55 t3fs07 Disconnected command timeout for Target 1
Aug 5 16:58:06 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug 5 16:58:06 t3fs07 Disconnected command timeout for Target 1
---++++ REBOOT
LSI Corporation MPT SAS BIOS
MPTBIOS-6.26.00.00 (2008.10.14) <-----
Copyright 2000-2008 LSI Corporation.
Adapter configuration may have changed, reconfiguration is suggested!
Searching for devices at HBA 0...
Searching for devices at HBA 1...
0de,376@f/pci1000,1000@0 (mpt5):
SLOT ID LUN VENDOR PRODUCT REVISION SIZE \ NV
---- --- --- -------- ---------------- ---------- ---------
0 0 0 ATA SEAGATE ST31000N SU0E 931 GB 000,1000@0 (mpt5):
0 1 0 ATA SEAGATE ST31000N SU0E 931 GB
0 LSILogic SAS1068E-IT 1.27.02.00 NV 2D:03 <-----
0 0 0 ATA SEAGATE ST31000N SU12 931 GB
0 1 0 ATA SEAGATE ST31000N SU0E 931 GB
0 LSILogic SAS1068E-IT 1.27.02.00 NV 2D:03 <-----
---++++ GRUB ERROR
'/platform/i86pc/multiboot -B zfs-bootfs=rpool/61,bootpath="/pci@0,0/pci-ide@4/ o@0
ide@0/cmdk@0,0:a",diskdevid="id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____0000137F/a"'
is loaded
module /platform/i86pc/boot_archive
zio_read_data failed
Error 16: Inconsistent filesystem structure
Booting 'Solaris 10 10/09 s10x_u8wos_08a X86'
findroot (pool_rpool,0,a)
Filesystem type is zfs, partition type 0xbf 0de,376@f/pci1000,1000@0 (mpt5):
kernel$ /platform/i86pc/multiboot -B $ZFS-BOOTFS
loading '/platform/i86pc/multiboot -B $ZFS-BOOTFS' ...
[Multiboot-elf, <0x1000000:0x1442b:0x12901>, shtab=0x1027258, entry=0x100000 000,1000@0 (mpt5):
0]
'/platform/i86pc/multiboot -B zfs-bootfs=rpool/61,bootpath="/pci@0,0/pci-ide@4/
ide@0/cmdk@0,0:a",diskdevid="id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____0000137F/a"'
is loaded
module /platform/i86pc/boot_archive
checksum verification failed
Error 16: Inconsistent filesystem structure
Press any key to continue...
SOLARIS FAILSAFE ATTEMPT
SunOS Release 5.10 Version Generic_141445-09 64-bit
Copyright 1983-2009 Sun Microsystems, Inc. All rights reserved.
Use is subject to license terms.
Booting to milestone "milestone/single-user:default".
Configuring devices.
WARNING: /pci@0,0/pci-ide@4/ide@0 unable to enable write cache targ=0
Searching for installed OS instances...
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@1,0 (sd41):
SCSI transport failed: reason 'reset': retrying command
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@1,0 (sd41):
SCSI transport failed: reason 'reset': giving up
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Disconnected command timeout for Target 1
AFTER THE NEW SOLARIS 11 INSTALLATION
# zfs snapshot rpool/ROOT/solaris@06-08-2013
# wget http://mirror.opencsw.org/opencsw/pkgutil.pkg
# pkgadd -d pkgutil.pkg
# /opt/csw/bin/pkgutil -i lsof -y
# /opt/csw/bin/pkgutil -i xpdf ggrep vim wgetpaste watch top sudo python pstree nrpe_plugin nano -y
# /opt/csw/bin/pkgutil -i nagios_plugins -y
# /opt/csw/bin/pkgutil -i nrpe -y
# /opt/csw/bin/pkgutil -i netsnmp -y
# /opt/csw/bin/pkgutil -y -i bash gawk emacs
# /opt/csw/bin/pkgutil -y -i smartmontools && svcadm disable cswsmartd
# /opt/csw/bin/pkgutil -y -i gsed rsync
# /opt/csw/bin/pkgutil -i -y CSWpm-libwww-perl
# svcadm disable cswrsyncd
# /opt/csw/bin/pkgutil -i netcat -y
# root@t3fs07:/export/home/jack# pkgadd -d ./SUNWhd-1.07.pkg
# root@t3fs07:/export/home/jack# hd -c -s
platform = Sun Fire X4540
Device Serial Vendor Model Rev Temperature
------ ------ ------ ----- ---- -----------
c10t0d0p0 9QJ5KV96 ATA SEAGATE ST31000N SU12 22 C (71 F)
c10t1d0p0 9QJ5QHCN ATA SEAGATE ST31000N SU0E 26 C (78 F)
c10t2d0p0 W9K0HZ0U061L ATA Hitachi HUA72201 A3EA 25 C (77 F)
c10t3d0p0 F002PAJUSJ4F ATA HITACHI HUA7210S AC5A 30 C (86 F)
c10t4d0p0 9QJ5RVKJ ATA SEAGATE ST31000N SU0E 23 C (73 F)
c10t5d0p0 9QJ5R7JX ATA SEAGATE ST31000N SU0E 25 C (77 F)
c10t6d0p0 9QJ5R6W5 ATA SEAGATE ST31000N SU12 25 C (77 F)
c10t7d0p0 A060PBK4JJTF ATA HITACHI HUA7210S AC5A 30 C (86 F)
c11t0d0p0 F002PBJTH4KF ATA HITACHI HUA7210S AC5A 23 C (73 F)
c11t1d0p0 9QJ5TMNF ATA SEAGATE ST31000N SU0E 24 C (75 F)
c11t2d0p0 9QJ5V7FP ATA SEAGATE ST31000N SU0E 25 C (77 F)
c11t3d0p0 9QJ5QJ1G ATA SEAGATE ST31000N SU0E 27 C (80 F)
c11t4d0p0 9QJ5V7FA ATA SEAGATE ST31000N SU0E 22 C (71 F)
c11t5d0p0 9QJ5QKCV ATA SEAGATE ST31000N SU0E 25 C (77 F)
c11t6d0p0 9QJ5LT8H ATA SEAGATE ST31000N SU0E 25 C (77 F)
c11t7d0p0 A060PBK528EF ATA HITACHI HUA7210S AC5A 29 C (84 F)
c12t0d0p0 9QJ5TM9Z ATA SEAGATE ST31000N SU0E 23 C (73 F)
c12t1d0p0 9QJ5RVQL ATA SEAGATE ST31000N SU0E 24 C (75 F)
c12t2d0p0 W9K0HD2XKHTL ATA HITACHI H7210CA3 A3CB 25 C (77 F)
c12t3d0p0 9QJ5RW4M ATA SEAGATE ST31000N SU0E 27 C (80 F)
c12t4d0p0 W9H0N01D14MV ATA Hitachi HUA72201 A3EA 22 C (71 F)
c12t5d0p0 9QJ5R9NJ ATA SEAGATE ST31000N SU0E 25 C (77 F)
c12t6d0p0 9QJ7VM1J ATA SEAGATE ST31000N SU0F 25 C (77 F)
c12t7d0p0 9QJ5MWQG ATA SEAGATE ST31000N SU0F 26 C (78 F)
c13t0d0p0 9QJ5V7FS ATA SEAGATE ST31000N SU0E 22 C (71 F)
c13t1d0p0 WMAW31661409 ATA WDC WD1003FBYX-0 1V02 25 C (77 F)
c13t2d0p0 W9K0HZ082KVL ATA Hitachi HUA72201 A3EA 25 C (77 F)
c13t3d0p0 A060PBK4ZS0F ATA HITACHI HUA7210S AC5A 29 C (84 F)
c13t4d0p0 9QJ5QY5N ATA SEAGATE ST31000N SU0E 23 C (73 F)
c13t5d0p0 9QJ5RV8M ATA SEAGATE ST31000N SU0E 25 C (77 F)
c13t6d0p0 9QJ5R6QV ATA SEAGATE ST31000N SU0E 26 C (78 F)
c13t7d0p0 9QJ5NHR3 ATA SEAGATE ST31000N SU0E 27 C (80 F)
c14t0d0p0 9QJ5TMJE ATA SEAGATE ST31000N SU0E 23 C (73 F)
c14t1d0p0 9QJ5RRV8 ATA SEAGATE ST31000N SU0E 25 C (77 F)
c14t2d0p0 9QJ5P70T ATA SEAGATE ST31000N SU0E 25 C (77 F)
c14t3d0p0 9QJ4YZAX ATA SEAGATE ST31000N SU0F 28 C (82 F)
c14t4d0p0 A060PBK56GJF ATA HITACHI HUA7210S AC5A 24 C (75 F)
c14t5d0p0 9QJ5S4JF ATA SEAGATE ST31000N SU0E 26 C (78 F)
c14t6d0p0 9QJ5TM8X ATA SEAGATE ST31000N SU0E 27 C (80 F)
c14t7d0p0 9QJ5QQAF ATA SEAGATE ST31000N SU0E 28 C (82 F)
c8d0p0 00014E4 - UGB30SDC16H0P4 - None
c9t0d0p0 9QJ5TMAY ATA SEAGATE ST31000N SU0E 23 C (73 F)
c9t1d0p0 9QJ5RR2P ATA SEAGATE ST31000N SU0E 26 C (78 F)
c9t2d0p0 W9K0HD2XV2DL ATA HITACHI H7210CA3 A3CB 26 C (78 F)
c9t3d0p0 9QJ3C4MZ ATA SEAGATE ST31000N SU0F 28 C (82 F)
c9t4d0p0 9QJ5QMCF ATA SEAGATE ST31000N SU0E 22 C (71 F)
c9t5d0p0 9QJ5RV9K ATA SEAGATE ST31000N SU0E 25 C (77 F)
c9t6d0p0 W9K0N015K0GL ATA Hitachi HUA72201 A3EA 25 C (77 F)
c9t7d0p0 9QJ5TN9J ATA SEAGATE ST31000N SU0E 27 C (80 F)
-----------------------------SunFire X4540-------Rear-----------------
3: 7: 11: 15: 19: 23: 27: 31: 35: 39: 43: 47:
c9t3 c9t7 c10t3 c10t7 c11t3 c11t7 c12t3 c12t7 c13t3 c13t7 c14t3 c14t7
^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
2: 6: 10: 14: 18: 22: 26: 30: 34: 38: 42: 46:
c9t2 c9t6 c10t2 c10t6 c11t2 c11t6 c12t2 c12t6 c13t2 c13t6 c14t2 c14t6
^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
1: 5: 9: 13: 17: 21: 25: 29: 33: 37: 41: 45:
c9t1 c9t5 c10t1 c10t5 c11t1 c11t5 c12t1 c12t5 c13t1 c13t5 c14t1 c14t5
^b+ ^++ ^b+ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
0: 4: 8: 12: 16: 20: 24: 28: 32: 36: 40: 44:
c9t0 c9t4 c10t0 c10t4 c11t0 c11t4 c12t0 c12t4 c13t0 c13t4 c14t0 c14t4
^b+ ^++ ^b+ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++ ^++
-------*---------*-----------SunFire X4540---*---Front-----*-------*---
Summary:
Vendor Model Count
------ ----- -----
ATA SEAGATE ST31000N 35
ATA Hitachi HUA72201 4
ATA HITACHI HUA7210S 6
ATA HITACHI H7210CA3 2
ATA WDC WD1003FBYX-0 1
Total Storage Devices = 48
WHICH ARE THE BROKEN DISKS ?
HD_26 c12t2
HD_41 c14t1 <-- this is really making going crazy Solaris that reacts with tens of:
Aug 6 22:44:18 t3fs07 scsi: [ID 107833 kern.warning] WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug 6 22:44:18 t3fs07 Disconnected command timeout for Target 1
HD_47 c14t7
IMPORTING ZFS /data1 INTO SOLARIS 11
Aug 7 10:50:19 t3fs07 zfs: [ID 249136 kern.info] imported version 15 pool data1 using 34
Aug 7 10:53:13 t3fs07 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-LR, TYPE: Fault, VER: 1, SEVERITY: Major
Aug 7 10:53:13 t3fs07 EVENT-TIME: Wed Aug 7 10:53:12 CEST 2013
Aug 7 10:53:13 t3fs07 PLATFORM: Sun-Fire-X4540, CSN: 0949AMR020, HOSTNAME: t3fs07
Aug 7 10:53:13 t3fs07 SOURCE: zfs-diagnosis, REV: 1.0
Aug 7 10:53:13 t3fs07 EVENT-ID: 13ded51b-674e-c92b-f852-9a456cc01793
Aug 7 10:53:13 t3fs07 DESC: ZFS device 'id1,sd@n5000c50019c4f1c2/a' in pool 'data1' failed to open.
Aug 7 10:53:13 t3fs07 AUTO-RESPONSE: An attempt will be made to activate a hot spare if available.
Aug 7 10:53:13 t3fs07 IMPACT: Fault tolerance of the pool may be compromised.
Aug 7 10:53:13 t3fs07 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-LR for the latest service procedures and policies regarding this diagnosis.
Aug 7 10:53:13 t3fs07 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-LR, TYPE: Fault, VER: 1, SEVERITY: Major
Aug 7 10:53:13 t3fs07 EVENT-TIME: Wed Aug 7 10:53:13 CEST 2013
Aug 7 10:53:13 t3fs07 PLATFORM: Sun-Fire-X4540, CSN: 0949AMR020, HOSTNAME: t3fs07
Aug 7 10:53:13 t3fs07 SOURCE: zfs-diagnosis, REV: 1.0
Aug 7 10:53:13 t3fs07 EVENT-ID: a70d0f7d-f4ff-e396-d9da-cb8d5caa4841
Aug 7 10:53:13 t3fs07 DESC: ZFS device 'id1,sd@n5000c50019b40654/a' in pool 'data1' failed to open.
Aug 7 10:53:13 t3fs07 AUTO-RESPONSE: An attempt will be made to activate a hot spare if available.
Aug 7 10:53:13 t3fs07 IMPACT: Fault tolerance of the pool may be compromised.
Aug 7 10:53:13 t3fs07 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-LR for the latest service procedures and policies regarding this diagnosis.
AFTER ZFS COMPLETED THE RESILVERING
-bash-4.1# zpool status
pool: data1
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
Run 'zpool status -v' to see device specific details.
see: http://support.oracle.com/msg/ZFS-8000-8A
scan: resilvered 1.44T in 11h53m with 1 errors on Wed Aug 7 22:47:36 2013
config:
NAME STATE READ WRITE CKSUM
data1 DEGRADED 0 0 1
raidz2-0 ONLINE 0 0 0
c9t0d0 ONLINE 0 0 0
c9t5d0 ONLINE 0 0 0
c10t2d0 ONLINE 0 0 0
c10t7d0 ONLINE 0 0 0
c11t4d0 ONLINE 0 0 0
c12t1d0 ONLINE 0 0 0
c12t6d0 ONLINE 0 0 0
c13t3d0 ONLINE 0 0 0
c14t0d0 ONLINE 0 0 0
raidz2-1 DEGRADED 0 0 121
c9t1d0 DEGRADED 0 0 121
c9t6d0 DEGRADED 0 0 121
c10t3d0 DEGRADED 0 0 121
c11t0d0 DEGRADED 0 0 121
c11t5d0 DEGRADED 0 0 121
spare-5 DEGRADED 0 0 0
15623725476041760867 UNAVAIL 0 0 0
c14t6d0 DEGRADED 0 0 0
c12t7d0 DEGRADED 1 0 0
c13t4d0 DEGRADED 0 0 121
spare-8 DEGRADED 0 0 0
1583280912036438145 UNAVAIL 0 0 0
c14t7d0 DEGRADED 0 0 0
raidz2-2 ONLINE 0 0 0
c9t2d0 ONLINE 0 0 0
c9t7d0 ONLINE 0 0 0
c10t4d0 ONLINE 0 0 0
c11t1d0 ONLINE 0 0 0
c11t6d0 ONLINE 0 0 0
c12t3d0 ONLINE 0 0 0
c13t0d0 ONLINE 0 0 0
c13t5d0 ONLINE 0 0 0
c14t2d0 ONLINE 0 0 0
raidz2-3 ONLINE 0 0 0
c9t3d0 ONLINE 0 0 0
c10t0d0 ONLINE 0 0 0
c10t5d0 ONLINE 0 0 0
c11t2d0 ONLINE 0 0 0
c11t7d0 ONLINE 0 0 0
c12t4d0 ONLINE 0 0 0
c13t1d0 ONLINE 0 0 0
c13t6d0 ONLINE 0 0 0
c14t3d0 ONLINE 0 0 0
raidz2-4 ONLINE 0 0 0
c9t4d0 ONLINE 0 0 0
c10t1d0 ONLINE 0 0 0
c10t6d0 ONLINE 0 0 0
c11t3d0 ONLINE 0 0 0
c12t0d0 ONLINE 0 0 0
c12t5d0 ONLINE 0 0 0
c13t2d0 ONLINE 0 0 0
c13t7d0 ONLINE 0 0 0
c14t4d0 ONLINE 0 0 0
spares
c14t7d0 INUSE
c14t6d0 INUSE
c14t5d0 AVAIL
errors: 1 data errors, use '-v' for a list
pool: rpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c8d0 ONLINE 0 0 0
errors: No known data errors
DETACHING THE BROKEN DISKS
-bash-4.1# zpool detach data1 15623725476041760867
-bash-4.1# zpool detach data1 1583280912036438145
ALLOWING root LOGIN by SSH
http://veereshkumarn.blogspot.ch/2012/09/how-to-enable-ssh-root-login-in-solaris.html
ALSO THE NEW FLASH CARD IS BROKEN!
Aug 9 10:35:44 t3fs07 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-8A, TYPE: Fault, VER: 1, SEVERITY: Critical
Aug 9 10:35:44 t3fs07 EVENT-TIME: Fri Aug 9 10:35:44 CEST 2013
Aug 9 10:35:44 t3fs07 PLATFORM: Sun-Fire-X4540, CSN: 0949AMR020, HOSTNAME: t3fs07
Aug 9 10:35:44 t3fs07 SOURCE: zfs-diagnosis, REV: 1.0
Aug 9 10:35:44 t3fs07 EVENT-ID: 94b57bca-b249-cac8-90dd-e65056c254ae
Aug 9 10:35:44 t3fs07 DESC: A file or directory in pool 'rpool' could not be read due to corrupt data.
Aug 9 10:35:44 t3fs07 AUTO-RESPONSE: No automated response will occur.
Aug 9 10:35:44 t3fs07 IMPACT: The file or directory is unavailable.
Aug 9 10:35:44 t3fs07 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -xv' and examine the list of damaged files to determine what has been affected. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-8A for the latest service procedures and policies regarding this diagnosis.
Aug 9 10:35:47 t3fs07 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major
Aug 9 10:35:47 t3fs07 EVENT-TIME: Fri Aug 9 10:35:47 CEST 2013
Aug 9 10:35:47 t3fs07 PLATFORM: Sun-Fire-X4540, CSN: 0949AMR020, HOSTNAME: t3fs07
Aug 9 10:35:47 t3fs07 SOURCE: zfs-diagnosis, REV: 1.0
Aug 9 10:35:47 t3fs07 EVENT-ID: fe9c1a29-d87c-6bf2-cd5a-f2ca0042d38b
Aug 9 10:35:47 t3fs07 DESC: The number of checksum errors associated with ZFS device 'id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____000014E4/b' in pool 'rpool' exceeded acceptable levels.
Aug 9 10:35:47 t3fs07 AUTO-RESPONSE: The device has been marked as degraded. An attempt will be made to activate a hot spare if available.
Aug 9 10:35:47 t3fs07 IMPACT: Fault tolerance of the pool may be compromised.
Aug 9 10:35:47 t3fs07 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-GH for the latest service procedures and policies regarding this diagnosis.
-bash-4.1# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 09 10:35:44 94b57bca-b249-cac8-90dd-e65056c254ae ZFS-8000-8A Critical
Problem Status : solved
Diag Engine : zfs-diagnosis / 1.0
System
Manufacturer : unknown
Name : unknown
Part_Number : unknown
Serial_Number : unknown
System Component
Manufacturer : Sun-Microsystems
Name : Sun-Fire-X4540
Part_Number : 602-4887-01
Serial_Number : 0949AMR020
Host_ID : 00c18d96
----------------------------------------
Suspect 1 of 1 :
Fault class : fault.fs.zfs.object.corrupt_data
Certainty : 100%
Affects : zfs://pool=916b26b45c63015a/pool_name=rpool
Status : faulted but still providing degraded service
FRU
Name : "zfs://pool=916b26b45c63015a/pool_name=rpool"
Status : faulty
Description : A file or directory in pool 'rpool' could not be read due to
corrupt data.
Response : No automated response will occur.
Impact : The file or directory is unavailable.
Action : Use 'fmadm faulty' to provide a more detailed view of this event.
Run 'zpool status -xv' and examine the list of damaged files to
determine what has been affected. Please refer to the associated
reference document at http://support.oracle.com/msg/ZFS-8000-8A
for the latest service procedures and policies regarding this
diagnosis.
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 09 10:35:47 fe9c1a29-d87c-6bf2-cd5a-f2ca0042d38b ZFS-8000-GH Major
Problem Status : solved
Diag Engine : zfs-diagnosis / 1.0
System
Manufacturer : unknown
Name : unknown
Part_Number : unknown
Serial_Number : unknown
System Component
Manufacturer : Sun-Microsystems
Name : Sun-Fire-X4540
Part_Number : 602-4887-01
Serial_Number : 0949AMR020
Host_ID : 00c18d96
----------------------------------------
Suspect 1 of 1 :
Fault class : fault.fs.zfs.vdev.checksum
Certainty : 100%
Affects : zfs://pool=916b26b45c63015a/vdev=42e34d6f3fa7092a/pool_name=rpool/vdev_name=id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____000014E4/b
Status : faulted but still providing degraded service
FRU
Name : "zfs://pool=916b26b45c63015a/vdev=42e34d6f3fa7092a/pool_name=rpool/vdev_name=id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____000014E4/b"
Status : faulty
Description : The number of checksum errors associated with ZFS device
'id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____000014E4/b' in pool 'rpool'
exceeded acceptable levels.
Response : The device has been marked as degraded. An attempt will be made
to activate a hot spare if available.
Impact : Fault tolerance of the pool may be compromised.
Action : Use 'fmadm faulty' to provide a more detailed view of this event.
Run 'zpool status -lx' for more information. Please refer to the
associated reference document at
http://support.oracle.com/msg/ZFS-8000-GH for the latest service
procedures and policies regarding this diagnosis.
REBOOT => OS corrupted
svc.configd: smf(5) database integrity check of:
/etc/svc/repository.db
failed. The database might be damaged or a media error might have
prevented it from being verified. Additional information useful to
your service provider is in:
/system/volatile/db_errors
The system will not be able to boot until you have restored a working
database. svc.startd(1M) will provide a sulogin(1M) prompt for recovery
purposes. The command:
/lib/svc/bin/restore_repository
can be run to restore a backup version of your repository. See
http://support.oracle.com/msg/SMF-8000-MY for more information.
Requesting System Maintenance Mode
(See /lib/svc/share/README for more information.)
svc.configd exited with status 102 (database initialization failure)
Enter user name for system maintenance (control-d to bypass): root
Enter root password (control-d to bypass):
single-user privilege assigned to root on /dev/console.
Entering System Maintenance Mode
Aug 9 11:46:35 su: pam_unix_cred: error creating /var/user/root: No such file or directory
Aug 9 11:46:35 su: pam_unix_cred: chown error on /var/user/root: No such file or directory
Aug 9 11:46:35 su: 'su root' succeeded for root on /dev/console
Oracle Corporation SunOS 5.11 11.1 September 2012
-bash-4.1#
-bash-4.1# /lib/svc/bin/restore_repository
See http://support.oracle.com/msg/SMF-8000-MY for more information on the use of
this script to restore backup copies of the smf(5) repository.
If there are any problems which need human intervention, this script will
give instructions and then exit back to your shell.
/lib/svc/bin/restore_repository[71]: [: /: arithmetic syntax error
The following backups of /etc/svc/repository.db exist, from
oldest to newest:
manifest_import-20130806_161857
manifest_import-20130806_162908
boot-20130806_185210
boot-20130807_102619
manifest_import-20130807_144541
manifest_import-20130807_150913
boot-20130808_102130
boot-20130809_102928
The backups are named based on their type and the time what they were taken.
Backups beginning with "boot" are made before the first change is made to
the repository after system boot. Backups beginning with "manifest_import"
are made after svc:/system/manifest-import:default finishes its processing.
The time of backup is given in YYYYMMDD_HHMMSS format.
Please enter either a specific backup repository from the above list to
restore it, or one of the following choices:
CHOICE ACTION
---------------- ----------------------------------------------
boot restore the most recent post-boot backup
manifest_import restore the most recent manifest_import backup
-seed- restore the initial starting repository (All
customizations will be lost, including those
made by the install/upgrade process.)
-quit- cancel script and quit
Enter response [boot]:
Unable to open database "/etc/svc/repository-boot": disk I/O error
After confirmation, the following steps will be taken:
svc.startd(1M) and svc.configd(1M) will be quiesced, if running.
/etc/svc/repository.db
-- renamed --> /etc/svc/repository.db_old_20130809_121110
//system/volatile/db_errors
-- copied --> /etc/svc/repository.db_old_20130809_121110_errors
/etc/svc/repository-boot
-- copied --> /etc/svc/repository.db
and the system will be rebooted with reboot(1M).
Proceed [yes/no]? yes
Quiescing svc.startd(1M) and svc.configd(1M): done.
/etc/svc/repository.db
-- renamed --> /etc/svc/repository.db_old_20130809_121110
//system/volatile/db_errors
-- copied --> /etc/svc/repository.db_old_20130809_121110_errors
/etc/svc/repository-boot
-- copied --> /etc/svc/repository.db
/etc/svc/repository.db.new.22: I/O error
Failed. To start svc.start(1M) running, do: /usr/bin/prun 11
-bash-4.1#
07-08-2013 - t3fs11 strange spares behaviour
root@t3fs11 $ zpool status -v
pool: data1
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
scrub: resilver in progress for 2h32m, 38.44% done, 4h3m to go
config:
NAME STATE READ WRITE CKSUM
data1 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t0d0 ONLINE 0 0 0
c1t5d0 ONLINE 0 0 0
c2t2d0 ONLINE 0 0 0
c2t7d0 ONLINE 0 0 0
c3t4d0 ONLINE 0 0 0
c4t1d0 ONLINE 0 0 0
c4t6d0 ONLINE 0 0 0
c5t3d0 ONLINE 0 0 0
c6t0d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t1d0 ONLINE 0 0 0
c1t6d0 ONLINE 0 0 0
c2t3d0 ONLINE 0 0 0
c6t5d0 ONLINE 0 0 0 310G resilvered <-- they were spares
c3t5d0 ONLINE 0 0 0
c4t2d0 ONLINE 0 0 0
c4t7d0 ONLINE 0 0 0
c6t7d0 ONLINE 0 0 0 310G resilvered <-- they were spares
c6t1d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t2d0 ONLINE 0 0 0
c1t7d0 ONLINE 0 0 0
c2t4d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
c3t6d0 ONLINE 0 0 0
c4t3d0 ONLINE 0 0 0
c5t0d0 ONLINE 0 0 0
c5t5d0 ONLINE 0 0 0
c6t2d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t3d0 ONLINE 0 0 0
c2t0d0 ONLINE 0 0 0
c2t5d0 ONLINE 0 0 0
c3t2d0 ONLINE 0 0 0
c3t7d0 ONLINE 0 0 0
c4t4d0 ONLINE 0 0 0
c5t1d0 ONLINE 0 0 0
c5t6d0 ONLINE 0 0 0
c6t3d0 ONLINE 0 0 0
raidz2 ONLINE 0 0 0
c1t4d0 ONLINE 0 0 0
c2t1d0 ONLINE 0 0 0
c2t6d0 ONLINE 0 0 0
c3t3d0 ONLINE 0 0 0
c4t0d0 ONLINE 0 0 0
c4t5d0 ONLINE 0 0 0
c5t2d0 ONLINE 0 0 0
c5t7d0 ONLINE 0 0 0
c6t4d0 ONLINE 0 0 0
spares
c6t6d0 AVAIL
c3t0d0 AVAIL <-- they were pool disks, I swapped both because they were broken
c5t4d0 AVAIL <-- they were pool disks
errors: Permanent errors have been detected in the following files:
/data1/t3fs11_cms_1/data/000002CC181F180E40F593A9313C8EAC5269 <--- I removed all of these files
/data1/t3fs11_cms_1/data/00004F74157CEDC241418C7ECD9A495EDC10
/data1/t3fs11_cms/data/000048FE1CBF54F14D4FA68AB53A5F54F21E
/data1/t3fs11_cms_1/data/0000F54388A555E542EFA8015AB2C86C9F40
/data1/t3fs11_cms_1/data/00002231135E4131437B81CEE1BEFC39978E
/data1/t3fs11_cms/data/0000E748611DBADE4505942C25077055E179
pool: rpool
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c0d0s0 ONLINE 0 0 0
errors: No known data errors
root@t3fs11 $
ZFS/FMD got disabled, never happened !
root@t3fs11 $ fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Aug 05 18:02:26 91b71d11-0d46-62ed-9e8b-b6df1c5ff285 FMD-8000-2K Minor
Host : t3fs11
Platform : Sun Fire X4540 Chassis_id : 0947AMR033
Fault class : defect.sunos.fmd.module
Affects : fmd:///module/zfs-diagnosis
faulted but still in service
Description : A Solaris Fault Manager component has experienced an error that
required the module to be disabled. Refer to
http://sun.com/msg/FMD-8000-2K for more information.
Response : The module has been disabled. Events destined for the module
will be saved for manual diagnosis.
Impact : Automated diagnosis and response for subsequent events associated
with this module will not occur.
Action : Use fmdump -v -u to locate the module. Use fmadm
reset to reset the module.
root@t3fs11 $ fmadm reset zfs-diagnosis
fmadm: zfs-diagnosis module has been reset
root@t3fs11 $ fmadm config
MODULE VERSION STATUS DESCRIPTION
cpumem-retire 1.1 active CPU/Memory Retire Agent
disk-monitor 1.0 active Disk Monitor
disk-transport 1.0 active Disk Transport Agent
eft 1.16 active eft diagnosis engine
fabric-xlate 1.0 active Fabric Ereport Translater
fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
io-retire 1.0 active I/O Retire Agent
snmp-trapgen 1.0 active SNMP Trap Generation Agent
sp-monitor 1.0 active Service Processor Monitor
sysevent-transport 1.0 active SysEvent Transport Agent
syslog-msgs 1.0 active Syslog Messaging Agent
zfs-diagnosis 1.0 active ZFS Diagnosis Engine
zfs-retire 1.0 active ZFS Retire Agent