MAY 2013
To get the Solaris broken disks statistics run:
[root@t3nagios ~]# /usr/local/bin/disks_failure_statistics.sh

Disk problems on the Thumper/Thor Fileservers

Important external information

ZFS_Best_Practices_Guide
ZFS Administration Guide (especially the Replacing Devices in a Storage Pool chapter)

Best practice for disk replacement

2013-05-31 Replacing a disk without resilvering a spare first

Starting situation: The ILOM has issued a predicitive failure for a disk based on SMART value detections.

Check Solaris Fault manager. We learn from it the number of the disk (28) in this example

root@t3fs09 $ fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
May 31 09:28:37 703aae75-6f49-638e-8ab5-eb08e580d005  DISK-8000-0X   Major

Host        : t3fs09
Platform    : Sun Fire X4540    Chassis_id  : 0949AMR064

Fault class : fault.io.disk.predictive-failure
Affects     : dev:///:devid=id1,sd@n5000c50019cb5f77//pci@3c,0/pci10de,377@a/pci1000,1000@0/sd@4,0
                  faulted but still in service
FRU         : "HD_ID_28" (hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR064:server-id=t3fs09:serial=9QJ5SE83:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=28/disk=0)
                  faulty

Have a look at the smart values and find the disk name (here: c4t4) using the hd tool

root@t3fs09 $ hd -R
...
25 c4t1         0 150  55880057542 27 0 0 33 25510 0 27 909 909  28  13  39 0 0 0 0
26 c4t2  79081685            0 29 23 8238056347322 29147 0 29 0 0 4295032833 0 639172636  28   0  20 79081685 0 0 0
27 c4t3         0 90  25786319174 10 0 0 31 9392 0 10 27 27  28  16  35 0 0 0 0
28 c4t4  76058431            0 29 1383 43229726855 29148 0 29 0 1754 131118 0 571801623  23   0  19 76058431 0 0 0       
...

find the device mapping that we need to use with the cfgadm command

root@t3fs09 $ cfgadm -a | grep c4t4
c4::dsk/c4t4d0                 disk         connected    configured   unknown

Offline the disk and uncofigure it, so that the blue disk LED helps you to locate it.

root@t3fs09 $ zpool offline data1 c4t4d0
root@t3fs09 $ cfgadm -c unconfigure c4::dsk/c4t4d0

Replace the disk in the Thor/Thumper

Make the disk active for the system by configuring it. Note: This regrettably always results in the system upon seeing the new disk needlessly activating a spare.

root@t3fs09 $ cfgadm -c configure c4::dsk/c4t4d0
root@t3fs09 $ zpool status -x
...
            spare           DEGRADED     0     0  108K
              replacing     DEGRADED     0     0     0
                c4t4d0s0/o  FAULTED      0     0     0  corrupted data
                c4t4d0      ONLINE       0     0     0  1.88G resilvered
              c6t5d0        ONLINE       0     0     0  3.74G resilvered
            c5t1d0          ONLINE       0     0     0
...
        spares
          c6t5d0            INUSE     currently in use
          c6t6d0            AVAIL
          c6t7d0            AVAIL

Stop the needlessly activated spare from resilvering and detach it

root@t3fs09 $ zpool scrub -s data1
root@t3fs09 $ zpool detach data1 c6t5d0
root@t3fs09 $ zpool status -x
...
            c3t7d0        ONLINE       0     0     0
            replacing     DEGRADED     0     0     0
              c4t4d0s0/o  FAULTED      0     0     0  corrupted data
              c4t4d0      ONLINE       0     0     0  643M resilvered
            c5t1d0        ONLINE       0     0     0
...

Issue the replacement command to have the new disk resilvered
```
root@t3fs09 $ zpool replace data1 c4t4d0
```
Enter the string used for the replacement (Here: zpool replace data1 c4t4d0) as a comment for the check_zfs_data1 test of the T3 Nagios. For this example the URL would be: https://t3nagios.psi.ch/nagios/cgi-bin/cmd.cgi?cmd_typ=34&host=t3fs09&service=check_zfs_data1

Typical example (old... from 2009)

Logwatch (q.v. CentralLogHost) shows the following for t3fs02

          Logfiles for Host: t3fs02
  ##################################################################

 --------------------- Kernel module scsi Begin ------------------------
You may have R/W errors on your device 2 Time(s)
        Requested Block: 203445440                 Error Block: 203445440: 1 time(s)
 WARNING: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@4,0 (sd5):: 2 time(s)
        ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0: 2 time(s)
        Sense Key: Aborted_Command: 3 time(s)
        Vendor: ATA                                Serial Number:             : 2 time(s)
        Requested Block: 203445696                 Error Block: 203445696: 1 time(s)
---------------------- Kernel module scsi End -------------------------

Logging in to t3fs02 and running a ZFS status command

zpool status -x 

 pool: data1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        data1        ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c9t0d0   ONLINE       0     0     0
            c9t4d0   ONLINE       0     0     0
...
            c10t5d0  ONLINE       0     0     0
            c9t1d0   ONLINE       0     0     1
            c9t5d0   ONLINE       0     0     0
...
       spares
          c4t3d0     AVAIL
          c4t7d0     AVAIL

errors: No known data errors

So, there is no damage yet, but one should keep an eye on that server. If the errors get more frequent, the disk should be replaced.

The disk with the checksum error in the status report is c9t1d0. But if I map the PCI name given in the log line, I end up with a different disk name

hd -w /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@4,0

        c4t4 = /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@4,0

Examples of failures

A particularly bad failure from T2_CH_CSCS on X4500/Solaris10 (happened twice within a few months, even though the backplane was exchanged):

Oct 20 09:55:15 se25.lcg.cscs.ch        Command failed to complete...Device is gone
Oct 20 09:55:15 se25.lcg.cscs.ch scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@4,0 (sd20):
Oct 20 09:55:15 se25.lcg.cscs.ch        drive offline

List of occurences year 2009:

t3fs02:

Sep 22 18:12:40 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@4,0 (sd5):
Sep 22 18:12:40 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci1022,7458@1/pci11ab,11ab@1/disk@4,0 (sd5):

Oct  1 21:33:18 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@6,0 (sd49):
Oct  1 21:33:18 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@2,0/pci1022,7458@8/pci11ab,11ab@1/disk@6,0 (sd49):

Oct 20 06:54:39 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@2,0 (sd21):
Oct 20 06:54:39 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@2,0 (sd21):
Oct 20 06:54:39 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@2,0 (sd21):
Oct 20 06:54:39 t3fs02 scsi: [ID 107833 kern.warning] WARNING: /pci@1,0/pci1022,7458@3/pci11ab,11ab@1/disk@2,0 (sd21):

Note: t3fs02 currently actually runs on OpenSolaris snv_86_rc3 X86, while the other servers run on Solaris10. It may well be that we only see warnings on this machine due to a difference in the log reporting. Especially since different disks and controllers are involved in the errors.

I cleared the error status for t3fs02 on 2010-01-07 since no more irrecoverable errors had appeared. Recoverable errors like the above seem to happen from time to time.

zpool clear data1

List of occurences year 2010:

t3fs07 2010-07-15 - example of a successful disk exchange in a running X4540 system

Disk: c3t7d0

I needed to introduce a spare disk manually. Since this was immediately before my holidays, I did it quite fast and then regrettably forgot about the incident after my vacation.

---------------------SunFireX4540-------Rear----------------------------

 3:    7:   11:   15:   19:   23:   27:   31:   35:   39:   43:   47:
c1t3  c1t7  c2t3  c2t7  c3t3  c3t7  c4t3  c4t7  c5t3  c5t7  c6t3  c6t7
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
 2:    6:   10:   14:   18:   22:   26:   30:   34:   38:   42:   46:
c1t2  c1t6  c2t2  c2t6  c3t2  c3t6  c4t2  c4t6  c5t2  c5t6  c6t2  c6t6
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
 1:    5:    9:   13:   17:   21:   25:   29:   33:   37:   41:   45:
c1t1  c1t5  c2t1  c2t5  c3t1  c3t5  c4t1  c4t5  c5t1  c5t5  c6t1  c6t5
^b+   ^++   ^b+   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
 0:    4:    8:   12:   16:   20:   24:   28:   32:   36:   40:   44:
c1t0  c1t4  c2t0  c2t4  c3t0  c3t4  c4t0  c4t4  c5t0  c5t4  c6t0  c6t4
^b+   ^++   ^b+   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++
-------*-----------*-SunFireX4540---*---Front----*---------*--------


cfgadm -al
Ap_Id                          Type         Receptacle   Occupant     Condition
...
c3::dsk/c3t7d0                 disk         connected    configured   unknown
...

cfgadm -c unconfigure c3::dsk/c3t7d0


          raidz2      DEGRADED     0     0     0
            c1t3d0    ONLINE       0     0     0
            c2t0d0    ONLINE       0     0     0
            c2t5d0    ONLINE       0     0     0
            c3t2d0    ONLINE       0     0     0
            spare     DEGRADED     0     0 40.0M
              c3t7d0  REMOVED      0     0     0
              c6t7d0  ONLINE       0     0     0  725G resilvered
            c4t4d0    ONLINE       0     0     0
            c5t1d0    ONLINE       0     0     0
            c5t6d0    ONLINE       0     0     0
            c6t3d0    ONLINE       0     0     0

I replaced disk 23 (marked by a blue LED) in the running system. This can be done if the cover is removed for less than 60 seconds.

Bringing the disk online:

cfgadm -c configure c3::dsk/c3t7d0

zpool replace data1 c3t7d0
# resilvering
root@t3fs07 $ zpool status -x
  pool: data1
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress for 0h0m, 0.07% done, 12h12m to go
config:

        NAME                STATE     READ WRITE CKSUM
        data1               DEGRADED     0     0     0
          raidz2            ONLINE       0     0     0
...
          raidz2            DEGRADED     0     0     0
            c1t3d0          ONLINE       0     0     0
            c2t0d0          ONLINE       0     0     0
            c2t5d0          ONLINE       0     0     0
            c3t2d0          ONLINE       0     0     0
            spare           DEGRADED     0     0 40.0M
              replacing     DEGRADED     0     0     0
                c3t7d0s0/o  FAULTED      0     0     0  corrupted data
                c3t7d0      ONLINE       0     0     0  441M resilvered
              c6t7d0        ONLINE       0     0     0  20.5K resilvered
            c4t4d0          ONLINE       0     0     0
            c5t1d0          ONLINE       0     0     0
            c5t6d0          ONLINE       0     0     0
            c6t3d0          ONLINE       0     0     0
...

After a few hours of resilvering, the spare disk was automatically taken out of the configuration and the array was fixed.

t3fs09 2010-09-12 - example of a successful disk exchange on a X4540 (OS powered down)

Automatic failover has happened

zpool status data1

...
resilver completed after 7h4m with 0 errors on Sun Sep 12 13:13:22 2010
...
          raidz2      DEGRADED     0     0     0
            c1t1d0    ONLINE       0     0     0
            c1t6d0    ONLINE       0     0     0
            c2t3d0    ONLINE       0     0     0
            c3t0d0    ONLINE       0     0     0
            c3t5d0    ONLINE       0     0     0
            c4t2d0    ONLINE       0     0     0
            spare     DEGRADED     0     0 24.5M
              c4t7d0  FAULTED      3    62     0  too many errors
              c6t7d0  ONLINE       0     0     0  445G resilvered
            c5t4d0    ONLINE       0     0     0
            c6t1d0    ONLINE       0     0     0
...

In the ILOM log I find

745    Fri Sep 10 01:21:23 2010  IPMI      Log       critical
       ID =   ce : 09/10/2010 : 01:21:23 : Drive Slot : DBP/HDD31/STATE : Drive
       Fault

The hd command suddenly blocked in the listing. Almost unkillable.

The internal Solaris 10 fault reporting showed:

root@t3fs09 $ fmadm  faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Sep 10 03:21:58 69cc60ac-9f06-4e60-f7fa-da22d6374ed2  DISK-8000-0X   Major

Host        : t3fs09
Platform    : Sun Fire X4540    Chassis_id  : 0949AMR064

Fault class : fault.io.disk.predictive-failure
Affects     : dev:///:devid=id1,sd@n5000c5002065fb8d//pci@3c,0/pci10de,377@a/pci1000,1000@0/sd@7,0
                  faulted but still in service
FRU         : "HD_ID_31" (hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR064:server-id=t3fs09:serial=9QJ5Z2FT:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=31/disk=0)
                  faulty

Description : SMART health-monitoring firmware reported that a disk
              failure is imminent.
              Refer to http://sun.com/msg/DISK-8000-0X for more information.

Response    : None.

Impact      : It is likely that the continued operation of
              this disk will result in data loss.

Action      : Schedule a repair procedure to replace the affected disk.
              Use fmdump -v -u  to identify the disk.

I tried to unconfigure the disk

# the disk name to be used in the cfgadm command one can obtain by this command
cfgadm -al

cfgadm -c unconfigure c4::dsk/c4t7d0

cfgadm: Hardware specific failure: failed to unconfigure SCSI device: Device busy

I then tried to first take the disk offline. The command succeeded, but the zpool status still looked the same, and also the unconfiguring failed again.

zpool offline data1 c4t7d0
cfgadm -c unconfigure c4::dsk/c4t7d0

cfgadm: Hardware specific failure: failed to unconfigure SCSI device: Device busy

Logging in on the console, I got the following message. It repeats every minute or so!!!!

13:23:53 t3fs09 scsi: WARNING: /pci@3c,0/pci10de,377@a/pci1000,1000@0 (mpt3):

# mapping the PCI address yields a disk on the same controller c4
hd -w /pci@3c,0/pci10de,377@a/pci1000,1000@0

        c4t0 = /pci@3c,0/pci10de,377@a/pci1000,1000@0

These items may be relevant:

iostat shows these errors

iostat -En

...
c4t7d0           Soft Errors: 2 Hard Errors: 4 Transport Errors: 9731
Vendor: ATA      Product: SEAGATE ST31000N Revision: SU0E Serial No:
Size: 1000.20GB <1000204885504 bytes>
Media Error: 1 Device Not Ready: 0 No Device: 3 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
...
c4t0d0           Soft Errors: 2 Hard Errors: 2 Transport Errors: 0
Vendor: ATA      Product: SEAGATE ST31000N Revision: SU0E Serial No:
Size: 1000.20GB <1000204885504 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 2 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0
...
# all other disks show typically this (all show soft errors, but only a few with hard errors)
c5t3d0           Soft Errors: 2 Hard Errors: 0 Transport Errors: 0
Vendor: ATA      Product: SEAGATE ST31000N Revision: SU0E Serial No:
Size: 1000.20GB <1000204885504 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 2 Predictive Failure Analysis: 0

I now exchanged the disk in the powered down system (OS was down, but I kept the machine under power and the management processor was online). The defect disk was marked by a blue LED.

Ok. Let's try to bring the disk online.

cfgadm -c configure c4::dsk/c4t7d0
#       still listed as FAULTED in the zpool status

zpool clear data1 c4t7d0
#      now listed as OFFLINE in the zpool status


zpool online data1 c4t7d0

SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Fri Oct  8 17:01:02 CEST 2010
PLATFORM: Sun Fire X4540, CSN: 0949AMR064              , HOSTNAME: t3fs09
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 4edd757e-3dfb-e504-a43c-f81be2b69de3
DESC: A ZFS device failed.  Refer to http://sun.com/msg/ZFS-8000-D3 for more information.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run 'zpool status -x' and replace the bad device.
warning: device 'c4t7d0' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present

So, the last two commands did not help much.

Ok. Let's use the replace command with the single disk argument. This should announce to the system that there is a new disk in the slot

zpool replace data1 c4t7d0

root@t3fs09 $ zpool status -x
  pool: data1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.17% done, 7h10m to go
config:

        NAME                STATE     READ WRITE CKSUM
        data1               DEGRADED     0     0     0
          raidz2            ONLINE       0     0     0
...
          raidz2            DEGRADED     0     0     0
            c1t1d0          ONLINE       0     0     0
            c1t6d0          ONLINE       0     0     0
            c2t3d0          ONLINE       0     0     0
            c3t0d0          ONLINE       0     0     0
            c3t5d0          ONLINE       0     0     0
            c4t2d0          ONLINE       0     0     0
            spare           DEGRADED     0     0    79
              replacing     DEGRADED     0     0     0
                c4t7d0s0/o  FAULTED      0     0     0  corrupted data
                c4t7d0      ONLINE       0     0     0  1.08G resilvered
              c6t7d0        ONLINE       0     0     0  41.5K resilvered
            c5t4d0          ONLINE       0     0     0
            c6t1d0          ONLINE       0     0     0
...

OK. This did it. The new disk immediately started to be resilvered. After it had finished, several hours later, the spare disk automatically was taken out of the raidz2 array again, and put into the standby state.

t3fs07 2010-09-30 warnings

Disk: c2t6d0

Entries in the central logs. Shows 16 read errors in the zpool status output.

Observation: The older broken disk and also this one show high counts in the SMART monitoring values for "Command Timeout Count". This seems significant.

No entry in the ILOM logs.

t3fs07 2010-10-18 disk failure

System is up, but zpool status just freezes. Dcache seems to hang as well

root@t3fs07 $ zpool status
  pool: data1
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 4h30m with 0 errors on Fri Oct  8 22:08:15 2010
...  hangs ...

The resilver mentioned above refers to the last problem of this file server.

fmadm faulty hangs for quite some time before yielding

Host        : t3fs07
Platform    : Sun Fire X4540    Chassis_id  : 0949AMR020

Fault class : fault.io.disk.predictive-failure
Affects     : dev:///:devid=id1,sd@n5000c50019c3b9c2//pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0
                  faulted but still in service
FRU         : "HD_ID_3" (hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR020:server-id=t3fs07:serial=9QJ5R2HB:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=3/disk=0)
                  faulty

Description : SMART health-monitoring firmware reported that a disk
              failure is imminent.
              Refer to http://sun.com/msg/DISK-8000-0X for more information.

Response    : None.

Impact      : It is likely that the continued operation of
              this disk will result in data loss.

Action      : Schedule a repair procedure to replace the affected disk.
              Use fmdump -v -u  to identify the disk.




root@t3fs07 $ fmdump -v -u df4e42b9-447f-ea75-8b80-d7165084fd40
TIME                 UUID                                 SUNW-MSG-ID
Oct 17 22:40:52.5630 df4e42b9-447f-ea75-8b80-d7165084fd40 DISK-8000-0X
  100%  fault.io.disk.predictive-failure

        Problem in: hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR020:server-id=t3fs07:serial=9QJ5R2HB:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=3/disk=0
           Affects: dev:///:devid=id1,sd@n5000c50019c3b9c2//pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0
               FRU: hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR020:server-id=t3fs07:serial=9QJ5R2HB:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=3/disk=0
          Location: HD_ID_3


root@t3fs07 $ fmdump
TIME                 UUID                                 SUNW-MSG-ID
Oct 08 17:38:06.5585 3f41b1d2-d666-c799-fa35-cb4dfa402077 ZFS-8000-D3
Oct 17 22:40:52.5630 df4e42b9-447f-ea75-8b80-d7165084fd40 DISK-8000-0X

The syslog is full of these messages

grep t3fs07 messages | sed -e 's/.*scsi: *//' | sort | uniq -c

    695 [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
    128 [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
    695 [ID 365881 kern.info] /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
      1 Oct 18 13:41:00 t3fs07.psi.ch genunix: [ID 773945 kern.info]    UltraDMA mode 2 selected

In the ILOM log I find

769    Sun Oct 17 21:39:34 2010  IPMI      Log       critical
       ID =  17e : 10/17/2010 : 21:39:34 : Drive Slot : DBP/HDD3/STATE : Drive F
       ault

I decided to reboot the system. Regrettably the system did not shut down. Console output

Oct 18 13:52:46 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 18 13:52:46 t3fs07  Disconnected command timeout for Target 3
Oct 18 13:53:57 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 18 13:53:57 t3fs07  Disconnected command timeout for Target 3

I had to force the system down through the ILOM with stop -force /SP.

The system came up with no disk marked as faulty. Everything seemed all right. All commands worked.... strange. I left it running in the hope that the next failure would trigger an automatic failover (which worked great the last times).

The server again failed during the night. The symptoms very similar.

Dcache does not deliver files any more
zpool status -x reports that all pools are healthy!!!
zpool status just hangs forever (cannot be killed... utterly lost in kernel space)
fmadm faulty hangs for a long time before issueing the same error as before the reboot, above(Event-ID df4e42b9-447f-ea75-8b80-d7165084fd40)
messages log points to disk /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (see below)

central messages log

# grep t3fs07 messages | sed -e 's/.*scsi: *//' | sort | uniq -c
    455 [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
     51 [ID 107833 kern.warning] WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
    455 [ID 365881 kern.info] /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):

Trying to map the disk that is reported as problematic in the messages log identifies the same disk as indicated by the fmadm commands, c1t3.

hd -w /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0

        c1t3 = /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0

Trying to power off the system:

poweroff
Oct 19 09:29:56 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:29:56 t3fs07  Disconnected command timeout for Target 3
Oct 19 09:29:57 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
Oct 19 09:29:57 t3fs07  drive offline
Oct 19 09:29:57 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
Oct 19 09:29:57 t3fs07  i/o to invalid geometry
Oct 19 09:31:07 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:31:07 t3fs07  Disconnected command timeout for Target 3
Oct 19 09:32:18 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:32:18 t3fs07  Disconnected command timeout for Target 3

I had to shut the system down forcefully through the ILOM with stop -force /SYS.

The system took some time to boot up again. After the reboot, zpool status worked correctly, and I decided to try a manual disk replace.

Trying to manually replace the faulted disk

zpool replace data1 c1t3d0 c6t7d0

# This seemed to start correcty, and I was able to monitor the beginning of the operation as usually
 root@t3fs07 $ zpool status data1
  pool: data1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.05% done, 14h6m to go
config:

        NAME          STATE     READ WRITE CKSUM
        data1         ONLINE       0     0     0
...

BUT then suddenly, the zpool status again blocks!!!!! And again, the commands keep hanging and cannot be terminated even by SIGKILL.

root@t3fs07 $ zpool status data1
  pool: data1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h2m, 0.46% done, 8h45m to go
*** HANGS - CANNOT BE KILLED ***

# on the console I see
# Oct 19 09:52:36 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
Oct 19 09:52:36 t3fs07  SCSI transport failed: reason 'reset': retrying command
Oct 19 09:53:18 t3fs07 scsi: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:53:18 t3fs07  mpt0: unknown event 13 received
Oct 19 09:54:20 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:54:20 t3fs07  Disconnected command timeout for Target 3
Oct 19 09:55:31 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:55:31 t3fs07  Disconnected command timeout for Target 3
Oct 19 09:56:42 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0 (mpt0):
Oct 19 09:56:42 t3fs07  Disconnected command timeout for Target 3
...
Oct 19 10:09:44 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
Oct 19 10:09:44 t3fs07  SCSI transport failed: reason 'reset': giving up
...
Oct 19 12:54:13 t3fs07 scsi: WARNING: /pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0 (sd27):
Oct 19 12:54:13 t3fs07  drive offline
...

Small correction: The zpool status command was terminated based on the SIGKILL after about 3 hours.

Opening the system at runtime shows a yellow LED on the defective disk c1t3d0 (slot 3).

On Wed. Oct 20 I received a replacement disk.

issued a shutdown of the system
had to shut the system down forcefully over the ILOM
exchanged the disk (still, a yellow LED is also shown after inserting the new disk
started /SYS
system again takes a long time in the initializing phase

Upon startup I get on the console after some time:

SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Wed Oct 20 14:08:25 CEST 2010
PLATFORM: Sun Fire X4540, CSN: 0949AMR020              , HOSTNAME: t3fs07
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 82b554c1-6136-6b20-ed16-f12378200985
DESC: A ZFS device failed.  Refer to http://sun.com/msg/ZFS-8000-D3 for more information.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run 'zpool status -x' and replace the bad device.



 root@t3fs07 $ zpool status -x
  pool: data1
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        data1         DEGRADED     0     0     0
          raidz2      ONLINE       0     0     0
...
          raidz2      DEGRADED     0     0     0
            spare     DEGRADED     0     0     0
              c1t3d0  FAULTED      0     0     0  too many errors
              c6t7d0  ONLINE       0     0     0
            c2t0d0    ONLINE       0     0     0
            c2t5d0    ONLINE       0     0     0
            c3t2d0    ONLINE       0     0     0
            c3t7d0    ONLINE       0     0     0
            c4t4d0    ONLINE       0     0     0
            c5t1d0    ONLINE       0     0     0
            c5t6d0    ONLINE       0     0     0
            c6t3d0    ONLINE       0     0     0
          raidz2      ONLINE       0     0     0
...
        spares
          c6t7d0      INUSE     currently in use
          c6t6d0      AVAIL
          c6t5d0      AVAIL

errors: No known data errors



cfgadm -al
Ap_Id                          Type         Receptacle   Occupant     Condition
c1                             scsi-bus     connected    configured   unknown
c1::dsk/c1t0d0                 disk         connected    configured   unknown
c1::dsk/c1t1d0                 disk         connected    configured   unknown
c1::dsk/c1t2d0                 disk         connected    configured   unknown
c1::dsk/c1t3d0                 disk         connected    configured   unknown
...

Making the system aware of the physical replacement of the disk

zpool replace data1 c1t3d0


root@t3fs07 $ zpool status
  pool: data1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h1m, 0.15% done, 16h12m to go
config:
...
          raidz2            DEGRADED     0     0     0
            spare           DEGRADED     0     0 48.4K
              replacing     DEGRADED     0     0     0
                c1t3d0s0/o  FAULTED      0     0     0  too many errors
                c1t3d0      ONLINE       0     0     0  1.00G resilvered
              c6t7d0        ONLINE       0     0     0  871M resilvered
            c2t0d0          ONLINE       0     0     0
            c2t5d0          ONLINE       0     0     0
            c3t2d0          ONLINE       0     0     0
            c3t7d0          ONLINE       0     0     0
...

After I had issued this command, it seems that now both the spare disk and the new disk are resilvered (the system had failed to correctly bring in the spare disk yesterday, when still the defect disk was in place). I hope that this will not lead to further complications.

I was able to stop the unnecessary resilvering of the spare disk by taking it out of the raidset using zpool detach:

zpool detach data1 c6t7d0


root@t3fs07 $ zpool status
  pool: data1
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.11% done, 7h28m to go
config:

        NAME              STATE     READ WRITE CKSUM
        data1             DEGRADED     0     0     0
...
            c6t2d0        ONLINE       0     0     0
          raidz2          DEGRADED     0     0     0
            replacing     DEGRADED     0     0     0
              c1t3d0s0/o  FAULTED      0     0     0  too many errors
              c1t3d0      ONLINE       0     0     0  779M resilvered
            c2t0d0        ONLINE       0     0     0
            c2t5d0        ONLINE       0     0     0
            c3t2d0        ONLINE       0     0     0
            c3t7d0        ONLINE       0     0     0
            c4t4d0        ONLINE       0     0     0
            c5t1d0        ONLINE       0     0     0
            c5t6d0        ONLINE       0     0     0
            c6t3d0        ONLINE       0     0     0
          raidz2          ONLINE       0     0     0
...
        spares
          c6t7d0          AVAIL
          c6t6d0          AVAIL
          c6t5d0          AVAIL

errors: No known data errors

Next morning, the zfs status command shows a complete resilvering has occured and everything looks good

fmadm faulty still shows the same error as above (from Oct 17). The newer failure events concerning this problem (Oct 19, 20) seem to have been correctly cleared, though.

root@t3fs07 $ fmdump

TIME                 UUID                                 SUNW-MSG-ID
Oct 08 17:38:06.5585 3f41b1d2-d666-c799-fa35-cb4dfa402077 ZFS-8000-D3
Oct 17 22:40:52.5630 df4e42b9-447f-ea75-8b80-d7165084fd40 DISK-8000-0X
Oct 19 20:44:20.6741 67164ab4-ab72-e1fc-c094-bd67f06d7db3 ZFS-8000-FD
Oct 20 14:08:26.1544 82b554c1-6136-6b20-ed16-f12378200985 ZFS-8000-D3

I manually made the fault manager aware of the repair of this old problem

root@t3fs07 $ fmadm repaired "dev:///:devid=id1,sd@n5000c50011234891//pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0"
fmadm: recorded repair to of dev:///:devid=id1,sd@n5000c50011234891//pci@0,0/pci10de,377@a/pci1000,1000@0/sd@3,0

root@t3fs07 $ fmadm faulty
   * no more output of failures *

t3fs11 2010-11-12 disk failure (no automatic failover, scsi timeouts)

On 2010-11-12 about 17:30h the dcache services on t3fs11 became unresponsive
SSH login did no longer work
login through SP console was worked, but a "fmadm faulty" command immediately blocked

SP/logs:

1734   Wed Nov 10 03:28:09 2010  IPMI      Log       critical
       ID =  391 : 11/10/2010 : 03:28:09 : Drive Slot : DBP/HDD40/STATE : Drive
       Fault

1727   Wed Oct 20 09:19:47 2010  IPMI      Log       critical
       ID =  38a : 10/20/2010 : 09:19:47 : Drive Slot : DBP/HDD33/STATE : Drive
       Fault

THIS IS STRANGE. I was dead sure that I had checked all nodes with "spool status / zpool status -x" after the last problems. I had not seen the older disk problem!

On console:

Nov 12 17:38:41 t3fs11  SCSI transport failed: reason 'reset': giving up
Nov 12 17:38:41 t3fs11 scsi: WARNING: /pci@3c,0/pci10de,375@b/pci1000,1000@0/sd@1,0 (sd33):
Nov 12 17:38:41 t3fs11  SCSI transport failed: reason 'reset': giving up
Nov 12 17:38:41 t3fs11 scsi: WARNING: /pci@3c,0/pci10de,375@b/pci1000,1000@0/sd@1,0 (sd33):
Nov 12 17:38:41 t3fs11  SCSI transport failed: reason 'reset': giving up

Nov 12 17:41:02 t3fs11 scsi: WARNING: /pci@3c,0/pci10de,375@b/pci1000,1000@0 (mpt4):
Nov 12 17:41:02 t3fs11  Disconnected command timeout for Target 1
Nov 12 17:41:03 t3fs11 scsi: WARNING: /pci@3c,0/pci10de,375@b/pci1000,1000@0/sd@1,0 (sd33):
Nov 12 17:41:03 t3fs11  SCSI transport failed: reason 'reset': giving up

I forcefully rebooted the system. It took a long time in the initialization phase.

After the OS was up, everything looked more or less ok, and a zpool status yielded

root@t3fs11 $ date
Fri Nov 12 17:56:42 CET 2010
root@t3fs11 $ zpool status -x
all pools are healthy

But an fmadm check reveals the problems!!!!!!

root@t3fs11 $ fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Nov 10 02:28:42 f06347f4-ced1-ccc9-de60-bb0c727e17aa  DISK-8000-0X   Major

Host        : t3fs11
Platform    : Sun Fire X4540    Chassis_id  : 0947AMR033

Fault class : fault.io.disk.predictive-failure
Affects     : dev:///:devid=id1,sd@n5000c50019d0b756//pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@0,0
                  faulted but still in service
FRU         : "HD_ID_40" (hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5RC85:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=40/disk=0)
                  faulty

Description : SMART health-monitoring firmware reported that a disk
              failure is imminent.
              Refer to http://sun.com/msg/DISK-8000-0X for more information.

Response    : None.

Impact      : It is likely that the continued operation of
              this disk will result in data loss.

Action      : Schedule a repair procedure to replace the affected disk.
              Use fmdump -v -u  to identify the disk.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Oct 20 09:20:10 5eb5618e-eb59-e6e4-852d-9c035d56d620  DISK-8000-0X   Major

Host        : t3fs11
Platform    : Sun Fire X4540    Chassis_id  : 0947AMR033

Fault class : fault.io.disk.predictive-failure
Affects     : dev:///:devid=id1,sd@n5000c50019bccc74//pci@3c,0/pci10de,375@b/pci1000,1000@0/sd@1,0
                  faulted but still in service
FRU         : "HD_ID_33" (hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5P7HJ:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=33/disk=0)
                  faulty

Description : SMART health-monitoring firmware reported that a disk
              failure is imminent.
              Refer to http://sun.com/msg/DISK-8000-0X for more information.

Response    : None.

Impact      : It is likely that the continued operation of
              this disk will result in data loss.

Action      : Schedule a repair procedure to replace the affected disk.
              Use fmdump -v -u  to identify the disk.

Let's read out the SMART values manually:

root@t3fs11 $ hd -R
 0 c1t0  71156486            0 29 3 4333649840 6802 0 29 0 0 0 0 589496355  35   0  21 71156486 0 0 0
 1 c1t1  156722990            0 29 0 8629727596 6802 0 29 0 0 0 0 656801830  38   0  23 156722990 0 0 0
 2 c1t2  82792345            0 29 1 12923635461 6802 0 29 0 0 0 0 673710120  40   0  22 82792345 0 0 0
 3 c1t3  131980351            0 28 1 38787307 6821 0 28 0 0 0 0 741015596  44   0  23 131980351 0 0 0
 4 c1t4   2105829            0 29 0 39140233 6802 0 29 0 0 0 0 589496355  35   0  21 2105829 0 0 0
 5 c1t5  238257017            0 29 0 8629173752 6802 0 29 0 1 0 0 623181861  37   0  21 238257017 0 0 0
 6 c1t6  113587266            0 29 0 39442020 6802 0 29 0 0 0 0 640024614  38   0  21 113587266 0 0 0
 7 c1t7   4408992            0 28 0 40359853 6821 0 28 0 0 0 0 707264553  41   0  22 4408992 0 0 0
 8 c2t0  162455847            0 30 2 4336062036 6802 0 30 0 0 0 0 589430819  35   0  21 162455847 0 0 0
 9 c2t1  145513360            0 30 0 4336311849 6802 0 30 0 0 0 0 623181861  37   0  22 145513360 0 0 0
10 c2t2  226809752            0 30 16 41208154 6802 0 30 0 0 0 0 656801831  39   0  21 226809752 0 0 0
11 c2t3  174125163            0 29 0 4338248677 6821 0 29 0 0 0 0 656867367  39   0  21 174125163 0 0 0
12 c2t4  219429979            0 30 1 40478020 6801 0 30 0 0 0 0 589496355  35   0  21 219429979 0 0 0
13 c2t5  197134363            0 30 0 4335708953 6802 0 30 0 0 0 0 623116325  37   0  21 197134363 0 0 0
14 c2t6  173069853            0 30 1 12925996326 6802 0 30 0 0 0 0 640024614  38   0  21 173069853 0 0 0
15 c2t7  170349382            0 30 1 8633168282 6802 0 30 0 0 0 0 690487337  41   0  22 170349382 0 0 0
16 c3t0  231005259            0 29 0 40836922 6802 0 29 0 0 0 0 572653602  34   0  21 231005259 0 0 0
17 c3t1  185048562            0 29 1 25810826414 6802 0 29 0 0 0 0 640024614  38   0  23 185048562 0 0 0
18 c3t2  153398438            0 30 3 41376728 6802 0 30 0 0 0 0 673644584  40   0  23 153398438 0 0 0
19 c3t3  184747717            0 29 0 43791532 6852 0 29 0 0 0 0 673644584  40   0  21 184747717 0 0 0
20 c3t4  129781851            0 30 10 40624346 6802 0 30 0 0 0 0 606273572  36   0  22 129781851 0 0 0
21 c3t5  90664177            0 30 0 8630917997 6802 0 30 0 0 0 0 623116325  37   0  21 90664177 0 0 0
22 c3t6  149266517            0 30 2 506843827975 6802 0 30 0 0 0 0 656801831  39   0  21 149266517 0 0 0
23 c3t7  87340628            0 29 1 8633518363 6822 0 29 0 0 0 0 656801831  39   0  21 87340628 0 0 0
24 c4t0  189545523            0 30 0 4335705428 6802 0 30 0 0 0 0 623116325  37   0  23 189545523 0 0 0
25 c4t1  170974309            0 30 0 41051470 6802 0 30 0 0 0 0 639959078  38   0  21 170974309 0 0 0
26 c4t2  206835020            0 30 1 8630880522 6802 0 30 0 0 0 0 656801831  39   0  22 206835020 0 0 0
27 c4t3  56131886            0 30 8 43358185 6802 0 30 0 0 0 0 673644584  40   0  22 56131886 0 0 0
28 c4t4  47595425            0 30 2 40986081 6802 0 30 0 0 0 0 606273572  36   0  21 47595425 0 0 0
29 c4t5  111191905            0 30 0 41100684 6802 0 30 0 0 0 0 639959078  38   0  23 111191905 0 0 0
30 c4t6  118020682            0 30 0 4335644549 6802 0 30 0 0 0 0 623181861  37   0  20 118020682 0 0 0
31 c4t7  48975824            0 30 0 4337809471 6802 0 30 0 0 0 0 690487337  41   0  23 48975824 0 0 0
32 c5t0  164236644            0 30 0 40837732 6802 0 30 0 0 0 0 623116325  37   0  22 164236644 0 0 0
33 c5t1  164149755            0 30 2047 39861850 6802 0 30 0 20 0 0 639959078  38   0  21 164149755 0 0 0
34 c5t2  222857622            0 30 0 41069638 6802 0 30 0 0 0 0 656801831  39   0  21 222857622 0 0 0
35 c5t3  26578537            0 30 63 12928258145 6802 0 30 0 0 0 0 690487337  41   0  22 26578537 0 0 0
36 c5t4  47674425            0 30 0 4335792581 6801 0 30 0 0 0 0 606273572  36   0  22 47674425 0 0 0
37 c5t5  49987836            0 30 0 4335657030 6802 0 30 0 0 0 0 673644584  40   0  24 49987836 0 0 0
38 c5t6  131060750            0 29 0 41090773 6848 0 29 0 0 0 0 656801831  39   0  21 131060750 0 0 0
39 c5t7    259680            0 30 4 12928404237 6802 0 30 0 0 0 0 673644584  40   0  22 259680 0 0 0
40 c6t0  33501316            0 30 1891 8631534340 6801 0 30 0 9 0 0 589430819  35   0  21 33501316 1 1 0
41 c6t1  88093475            0 30 0 4336384462 6802 0 30 0 0 0 0 639959078  38   0  22 88093475 0 0 0
42 c6t2  12197846            0 30 2 8631181874 6802 0 30 0 0 0 0 673644584  40   0  23 12197846 0 0 0
43 c6t3  120037422            0 29 0 43593096 6821 0 29 0 0 0 0 724107307  43   0  23 120037422 0 0 0
44 c6t4  173604553            0 30 0 40575636 6801 0 30 0 0 0 0 606273572  36   0  22 173604553 0 0 0
45 c6t5  122179284            0 29 0 17139749 6854 0 29 0 0 0 0 673579048  40   0  23 122179284 0 0 0
46 c6t6  203753234            0 30 4 30082236234 6802 0 30 0 0 0 0 673579048  40   0  22 203753234 0 0 0
47 c6t7  150210603            0 28 0 4313753034 6847 0 29 0 0 0 0 690487337  41   0  21 150210603 0 0 0

The unnaturally high values in the 6th column for both of these disks refer to the Reallocated sector count SMART value. Also the Uncorrectable Errors for Host values are greater than zero (20 and 9) for these two disks. Mappings:

HD33 maps	c5t1	c5::dsk/c5t1d0
HD40 maps	c6t0	c6::dsk/c6t0d0

The pool setup for this zpool:

root@t3fs11 $ zpool history
History for 'data1':
2010-08-17.15:44:09 zpool create -f data1 raidz2 c1t0d0 c1t5d0 c2t2d0 c2t7d0 c3t4d0 c4t1d0 c4t6d0 c5t3d0 c6t0d0
2010-08-17.15:44:15 zpool add -f data1 raidz2 c1t1d0 c1t6d0 c2t3d0 c3t0d0 c3t5d0 c4t2d0 c4t7d0 c5t4d0 c6t1d0
2010-08-17.15:44:20 zpool add -f data1 raidz2 c1t2d0 c1t7d0 c2t4d0 c3t1d0 c3t6d0 c4t3d0 c5t0d0 c5t5d0 c6t2d0
2010-08-17.15:44:25 zpool add -f data1 raidz2 c1t3d0 c2t0d0 c2t5d0 c3t2d0 c3t7d0 c4t4d0 c5t1d0 c5t6d0 c6t3d0
2010-08-17.15:44:30 zpool add -f data1 raidz2 c1t4d0 c2t1d0 c2t6d0 c3t3d0 c4t0d0 c4t5d0 c5t2d0 c5t7d0 c6t4d0
2010-08-17.15:44:33 zpool add -f data1 spare c6t7d0 c6t6d0 c6t5d0

Since I am expecting potential problems with the SCSI communications (as in the last problems, above), I want to remove these disks as good as possible from the active system.

root@t3fs11 $ zpool offline data1 c5t1d0
root@t3fs11 $ zpool offline data1 c6t0d0

root@t3fs11 $ cfgadm -c unconfigure c5::dsk/c5t1d0
root@t3fs11 $ cfgadm -c unconfigure c6::dsk/c6t0d0


root@t3fs11 $ zpool status
  pool: data1
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data1       DEGRADED     0     0     0
          raidz2    DEGRADED     0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
            c2t7d0  ONLINE       0     0     0
            c3t4d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c4t6d0  ONLINE       0     0     0
            c5t3d0  ONLINE       0     0     0
            c6t0d0  OFFLINE      0     0     0
...
         raidz2    DEGRADED     0     0     0
            c1t3d0  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t5d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     0
            c3t7d0  ONLINE       0     0     0
            c4t4d0  ONLINE       0     0     0
            c5t1d0  OFFLINE      0     0     0
            c5t6d0  ONLINE       0     0     0
            c6t3d0  ONLINE       0     0     0
...

Ok... replacing the first broken disk with one of the spare disks

root@t3fs11 $ zpool replace data1 c5t1d0 c6t7d0

The resilver started well as observed by zpool status... I can see no SCSI errors on the console over several minutes. Getting a bit more daring... I try to resilver the second disk in parallel. The disks are in different RAID sets, so this should not hurt too much.

root@t3fs11 $ zpool replace data1 c6t0d0 c6t6d0

Seems to work.... about 8 hours to go. I will not run dcache on these nodes during that time... let's not push our luck. The resilvering operation terminated successfully after 11 hours.

I exchanged both disks physically on 2010-11-17 15:30. Both were correctly marked by blue LEDs.

root@t3fs11 $ zpool status -x
  pool: data1
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver completed after 11h19m with 0 errors on Sat Nov 13 06:01:03 2010
config:

        NAME          STATE     READ WRITE CKSUM
        data1         DEGRADED     0     0     0
          raidz2      DEGRADED     0     0     0
            c1t0d0    ONLINE       0     0     0
            c1t5d0    ONLINE       0     0     0
            c2t2d0    ONLINE       0     0     0
            c2t7d0    ONLINE       0     0     0
            c3t4d0    ONLINE       0     0     0
            c4t1d0    ONLINE       0     0     0
            c4t6d0    ONLINE       0     0     0
            c5t3d0    ONLINE       0     0     0
            spare     DEGRADED     0     0 31.8M
              c6t0d0  UNAVAIL      0     0     0  cannot open
              c6t6d0  ONLINE       0     0     0  577G resilvered
          raidz2      ONLINE       0     0     0
            c1t1d0    ONLINE       0     0     0
            c1t6d0    ONLINE       0     0     0
            c2t3d0    ONLINE       0     0     0
            c3t0d0    ONLINE       0     0     0
            c3t5d0    ONLINE       0     0     0
            c4t2d0    ONLINE       0     0     0
            c4t7d0    ONLINE       0     0     0
            c5t4d0    ONLINE       0     0     0
            c6t1d0    ONLINE       0     0     0
          raidz2      ONLINE       0     0     0
            c1t2d0    ONLINE       0     0     0
            c1t7d0    ONLINE       0     0     0
            c2t4d0    ONLINE       0     0     0
            c3t1d0    ONLINE       0     0     0
            c3t6d0    ONLINE       0     0     0
            c4t3d0    ONLINE       0     0     0
            c5t0d0    ONLINE       0     0     0
            c5t5d0    ONLINE       0     0     0
            c6t2d0    ONLINE       0     0     0
          raidz2      DEGRADED     0     0     0
            c1t3d0    ONLINE       0     0     0
            c2t0d0    ONLINE       0     0     0
            c2t5d0    ONLINE       0     0     0
            c3t2d0    ONLINE       0     0     0
            c3t7d0    ONLINE       0     0     0
            c4t4d0    ONLINE       0     0     0
            spare     DEGRADED     0     0 31.5M
              c5t1d0  UNAVAIL      0     0     0  cannot open
              c6t7d0  ONLINE       0     0     0  578G resilvered
            c5t6d0    ONLINE       0     0     0
            c6t3d0    ONLINE       0     0     0
          raidz2      ONLINE       0     0     0
            c1t4d0    ONLINE       0     0     0
            c2t1d0    ONLINE       0     0     0
            c2t6d0    ONLINE       0     0     0
            c3t3d0    ONLINE       0     0     0
            c4t0d0    ONLINE       0     0     0
            c4t5d0    ONLINE       0     0     0
            c5t2d0    ONLINE       0     0     0
            c5t7d0    ONLINE       0     0     0
            c6t4d0    ONLINE       0     0     0
        spares
          c6t7d0      INUSE     currently in use
          c6t6d0      INUSE     currently in use
          c6t5d0      AVAIL

errors: No known data errors

root@t3fs11 $ cfgadm -c configure  c5::dsk/c5t1d0

root@t3fs11 $ zpool online data1 c5t1d0
warning: device 'c5t1d0' onlined, but remains in faulted state
use 'zpool replace' to replace devices that are no longer present

Still, the disk remains in unavailable state when querying zpool status. Let's follow the message and issue a replace

root@t3fs11 $ zpool replace data1 c5t1d0


root@t3fs11 $ zpool status -x
...
            c3t7d0          ONLINE       0     0     0
            c4t4d0          ONLINE       0     0     0
            spare           DEGRADED     0     0 31.5M
              replacing     DEGRADED     0     0     0
                c5t1d0s0/o  FAULTED      0     0     0  corrupted data
                c5t1d0      ONLINE       0     0     0  526M resilvered
              c6t7d0        ONLINE       0     0     0  8K resilvered
            c5t6d0          ONLINE       0     0     0
...

Same procedure for second disk

root@t3fs11 $ cfgadm -c configure c6::dsk/c6t0d0
root@t3fs11 $ zpool online data1 c6t0d0

t3fs10 2010-11-22 drive failure (no automatic failover, scsi timeouts)

The zpool command blocks

zpool status -x

Fault manager

fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Nov 22 08:00:11 4913b692-b29e-682a-f09a-9491300fb237  DISK-8000-0X   Major

Host        : t3fs10
Platform    : Sun Fire X4540    Chassis_id  : 0949AMR021

Fault class : fault.io.disk.predictive-failure
Affects     : dev:///:devid=id1,sd@n5000c50019c19f38//pci@0,0/pci10de,376@f/pci1000,1000@0/sd@7,0
                  faulted but still in service
FRU         : "HD_ID_23" (hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR021:server-id=t3fs10:serial=9QJ5QJ4T:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=23/disk=0)
                  faulty

Description : SMART health-monitoring firmware reported that a disk
              failure is imminent.
              Refer to http://sun.com/msg/DISK-8000-0X for more information.

Response    : None.

Impact      : It is likely that the continued operation of
              this disk will result in data loss.

Action      : Schedule a repair procedure to replace the affected disk.
              Use fmdump -v -u  to identify the disk.

705    Mon Nov 22 08:58:07 2010  IPMI      Log       critical
       ID =  17b : 11/22/2010 : 08:58:07 : Drive Slot : DBP/HDD23/STATE : Drive
       Fault

In central messages log

Nov 22 08:00:11 t3fs10.psi.ch fmd: [ID 441519 daemon.error] SUNW-MSG-ID: DISK-8000-0X, TYPE: Fault, VER: 1, SEVERITY: Major
Nov 22 08:00:11 t3fs10.psi.ch EVENT-TIME: Mon Nov 22 08:00:11 CET 2010
Nov 22 08:00:11 t3fs10.psi.ch PLATFORM: Sun Fire X4540, CSN: 0949AMR021              , HOSTNAME: t3fs10
Nov 22 08:00:11 t3fs10.psi.ch SOURCE: eft, REV: 1.16
Nov 22 08:00:11 t3fs10.psi.ch EVENT-ID: 4913b692-b29e-682a-f09a-9491300fb237

On the console:

root@t3fs10 $ Nov 22 15:23:14 t3fs10 scsi: WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0 (mpt2):
Nov 22 15:23:14 t3fs10  Disconnected command timeout for Target 7
Nov 22 15:25:05 t3fs10 scsi: WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0 (mpt2):
Nov 22 15:25:05 t3fs10  Disconnected command timeout for Target 7
Nov 22 15:26:16 t3fs10 scsi: WARNING: /pci@0,0/pci10de,376@f/pci1000,1000@0 (mpt2):

Did a forced shutdown and reboot. Afterwards the zpool command did no longer block and I was able to get more diagnostics, again

root@t3fs10 $ hd -R
 0 c1t0  125535798            0 19 0 12960759222 7049 0 19 0 0 0 1 454754331  27   0  20 125535798 0 0 0
 1 c1t1  237015775            0 19 0 47328169527 7050 0 19 0 0 0 0 505217054  29   0  20 237015775 0 0 0
 2 c1t2  182514992            0 19 1 4378448541 7050 0 19 0 0 0 0 522059807  31   0  19 182514992 0 0 0
 3 c1t3  164619320            0 19 0 107455757220 7049 0 19 0 0 0 0 538968096  32   0  20 164619320 0 0 0
 4 c1t4  164321560            0 19 0 141818656169 7050 0 19 0 0 0 0 454688795  27   0  20 164321560 0 0 0
 5 c1t5  219147005            0 19 0 60206395946 7050 0 19 0 0 0 0 488439837  29   0  20 219147005 0 0 0
 6 c1t6  83774341            0 19 3 8674093422 7050 0 19 0 0 0 0 505217054  30   0  19 83774341 0 0 0
 7 c1t7  82812396            0 19 0 74561503 7049 0 19 0 0 0 0 522125343  31   0  19 82812396 0 0 0
 8 c2t0  18252396            0 19 0 21556449891 7050 0 19 0 0 0 0 437846042  26   0  20 18252396 0 0 0
 9 c2t1  59917419            0 19 1 85251527 7049 0 19 0 0 0 0 488374301  29   0  20 59917419 0 0 0
10 c2t2  106273402            0 19 0 34435821549 7049 0 19 0 0 0 0 488439837  29   0  19 106273402 0 0 0
11 c2t3  58113178            0 19 0 4379533373 7049 0 19 0 0 0 0 555810849  33   0  21 58113178 0 0 0
12 c2t4  111165675            0 19 0 189062989066 7049 0 19 0 0 0 0 437846042  26   0  20 111165675 0 0 0
13 c2t5  146995884            0 19 0 8671816057 7049 0 19 0 0 0 0 454754331  27   0  20 146995884 0 0 0
14 c2t6  191086390            0 19 0 4380041942 7049 0 19 0 0 0 0 505217053  29   0  20 191086390 0 0 0
15 c2t7  162089547            0 19 0 17255685134 7050 0 19 0 0 0 0 538968096  32   0  21 162089547 0 0 0
16 c3t0  176979997            0 19 0 34443372067 7049 0 19 0 0 0 0 421003289  24   0  20 176979997 0 0 0
17 c3t1   5548231            0 19 0 84007444 7049 0 19 0 0 0 0 454754331  27   0  20 5548231 0 0 0
18 c3t2  173448060            0 19 0 8671492950 7049 0 19 0 0 0 0 471597084  28   0  20 173448060 0 0 0
19 c3t3   4924500            0 19 0 773178340077 7049 0 19 0 0 0 0 488439837  29   0  19 4924500 0 0 0
20 c3t4  44076624            0 19 0 21551476816 7049 0 19 0 0 0 0 421003289  24   0  19 44076624 0 0 0
21 c3t5  225718031            0 19 0 83790115 7049 0 19 0 0 0 0 437911578  26   0  19 225718031 0 0 0
22 c3t6  175558069            0 19 0 38738911150 7049 0 19 0 0 0 0 488439837  29   0  20 175558069 0 0 0
23 c3t7  26969752            0 19 2005 81773334 7049 0 19 0 885 5 0 505282590  30   0  20 26969752 42 42 0
24 c4t0   9005087            0 19 0 12969472234 7049 0 19 0 0 0 0 421003289  25   0  20 9005087 0 0 0
25 c4t1  16782365            0 18 0 8666295329 7071 0 18 0 0 0 0 454688795  27   0  20 16782365 0 0 0
26 c4t2  24234100            0 19 0 47328536916 7049 0 19 0 0 0 0 488439837  29   0  20 24234100 0 0 0
27 c4t3  62181622            0 19 0 154703494002 7049 0 19 0 0 0 0 505282590  30   0  20 62181622 0 0 0
28 c4t4  91302033            0 19 0 21556859439 7049 0 19 0 0 0 0 421003289  25   0  20 91302033 0 0 0
29 c4t5  47181631            0 19 0 17265235415 7049 0 19 0 0 0 0 437911578  26   0  19 47181631 0 0 0
30 c4t6  196503721            0 19 2 4370861630 7049 0 19 0 0 0 0 471597084  28   0  20 196503721 0 0 0
31 c4t7   8413596            0 19 5 4378336008 7049 0 19 0 0 0 0 505282590  30   0  20 8413596 0 0 0
32 c5t0  189146743            0 18 3 12969013889 7049 0 18 0 0 0 0 404226072  24   0  20 189146743 0 0 0
33 c5t1  180453604            0 18 0 81297831 7049 0 18 0 0 0 0 454688795  27   0  19 180453604 0 0 0
34 c5t2  129281214            0 18 0 4379814706 7049 0 18 0 0 0 0 471597084  28   0  19 129281214 0 0 0
35 c5t3  45659498            0 18 1 4371302900 7049 0 18 0 0 0 0 488374301  29   0  19 45659498 0 0 0
36 c5t4  122992578            0 18 0 4378594187 7049 0 18 0 0 0 0 421003289  25   0  20 122992578 0 0 0
37 c5t5  201472395            0 18 1 38738707175 7049 0 18 0 0 0 0 454754331  27   0  20 201472395 0 0 0
38 c5t6  138606984            0 18 0 8671822466 7049 0 18 0 0 0 0 488439837  29   0  20 138606984 0 0 0
39 c5t7  116268514            0 18 0 8674426890 7049 0 18 0 0 0 0 505282590  30   0  20 116268514 0 0 0
40 c6t0  55061066            0 18 0 60204989916 7049 0 18 0 0 0 0 421068825  25   0  20 55061066 0 0 0
41 c6t1  10142376            0 18 1 429580868094 7049 0 18 0 0 0 0 454754331  27   0  20 10142376 0 0 0
42 c6t2  81596927            0 18 0 12969434046 7049 0 18 0 0 0 0 471531547  27   0  19 81596927 0 0 0
43 c6t3  35092129            0 18 1 4376786413 7049 0 18 0 0 0 0 488374301  29   0  20 35092129 0 0 0
44 c6t4  57483251            0 18 1 30150093388 7049 0 18 0 0 0 0 421003289  25   0  20 57483251 0 0 0
45 c6t5  170939821            0 18 918 34375768257 7049 0 18 0 0 0 0 471597084  28   0  21 170939821 3 3 0
46 c6t6  168553543            0 18 0 15841561 7049 0 18 0 0 0 0 471531548  28   0  19 168553543 0 0 0
47 c6t7  171788766            0 18 0 4310892064 7049 0 18 0 0 0 0 505282590  30   0  20 171788766 0 0 0

Mappings:

HD23 maps	c3t7	c3::dsk/c3t7d0

HD23 maps

c3t7

c3::dsk/c3t7d0

zpool offline data1 c3t7d0
cfgadm -c unconfigure c3::dsk/c3t7d0

zpool replace data1 c3t7d0 c6t7d0

The spare disk seems to be correctly resilvering. I will start file services and dcache again.

root@t3fs10 $ zpool status
  pool: data1
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
 scrub: resilver in progress for 0h4m, 0.55% done, 12h2m to go
config:
...
          raidz2      DEGRADED     0     0     0
            c1t3d0    ONLINE       0     0     0
            c2t0d0    ONLINE       0     0     0
            c2t5d0    ONLINE       0     0     0
            c3t2d0    ONLINE       0     0     0
            spare     DEGRADED     0     0  226K
              c3t7d0  OFFLINE      0     0     0
              c6t7d0  ONLINE       0     0     0  3.98G resilvered
            c4t4d0    ONLINE       0     0     0
            c5t1d0    ONLINE       0     0     0
            c5t6d0    ONLINE       0     0     0
            c6t3d0    ONLINE       0     0     0
....


# Later
scrub: resilver completed after 12h39m with 0 errors on Tue Nov 23 04:41:01 2010

List of occurences year 2011:

2011-01-10 t3fs10 drive failure (of a unused spare disk)

The node was marked by yellow service LED, fmadm manager and ILOM log, but zpool status -x shows that all pools are healthy. The fault is flagged as an immanent failure due to SMART monitoring.

root@t3fs10 $ fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Dec 14 09:34:24 b6f7ad51-5cfc-4f2b-bab0-a57d939eadde  DISK-8000-0X   Major

Host        : t3fs10
Platform    : Sun Fire X4540    Chassis_id  : 0949AMR021

Fault class : fault.io.disk.predictive-failure
Affects     : dev:///:devid=id1,sd@n5000c50019c55a36//pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@5,0
                  faulted but still in service
FRU         : "HD_ID_45" (hc://:product-id=Sun-Fire-X4540:chassis-id=0949AMR021:server-id=t3fs10:serial=9QJ5R4WX:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=45/disk=0)
                  faulty

Description : SMART health-monitoring firmware reported that a disk
              failure is imminent.
              Refer to http://sun.com/msg/DISK-8000-0X for more information.

Response    : None.

Impact      : It is likely that the continued operation of
              this disk will result in data loss.

Action      : Schedule a repair procedure to replace the affected disk.
              Use fmdump -v -u  to identify the disk.

748    Mon Jan 10 11:28:48 2011  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
747    Tue Dec 14 09:34:11 2010  IPMI      Log       critical
       ID =  195 : 12/14/2010 : 09:34:11 : Drive Slot : DBP/HDD45/STATE : Drive
       Fault
746    Wed Dec  1 15:49:28 2010  IPMI      Log       critical
       ID =  194 : 12/01/2010 : 15:49:28 : Drive Slot : DBP/HDD23/STATE : Drive
       Fault   (PREVIOUS FAULT... ALREADY FIXED)

SMART values show a high Reallocated sector count

hd -R
...
42 c6t2  79087142            0 18 0 12978519190 8221 0 18 0 0 0 0 538640414  30   0  19 79087142 0 0 0
43 c6t3  81712106            0 18 1 4386173981 8221 0 18 0 0 0 0 555483167  31   0  20 81712106 0 0 0
44 c6t4  106459455            0 18 1 34454857772 8221 0 18 0 0 0 0 471334939  27   0  20 106459455 0 0 0
45 c6t5  171469164            0 18 1739 34378481112 8221 0 18 0 0 0 0 538705951  31   0  21 171469164 4 4 0
46 c6t6  169113078            0 18 0 18445266 8221 0 18 0 0 0 0 538640414  30   0  19 169113078 0 0 0
47 c6t7  126977783            0 18 0 4318805528 8221 0 18 0 0 0 0 572391456  33   0  20 126977783 0 0 0

mapping to the disk name

root@t3fs10 $ hd -w /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@5,0

        c6t5 = /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@5,0

Funny enough, this disk is actually a spare disk of the configuration. I do not understand how the reallocated sector count problem was able to arise there.

root@t3fs10 $ zpool offline data1 c6t5d0
cannot offline c6t5d0: device is reserved as a hot spare

root@t3fs10 $ cfgadm -c unconfigure c6::dsk/c6t5d0
cfgadm: Hardware specific failure: failed to unconfigure SCSI device: Device busy

root@t3fs10 $ zpool detach data1 c6t5d0
cannot detach c6t5d0: device is reserved as a hot spare

root@t3fs10 $ zpool remove data1 c6t5d0       # THAT ONE WORKS!!!!!
cfgadm -c unconfigure c6::dsk/c6t5d0

Replaced the disk, then bringing it into the configuration, again

cfgadm -c configure c6::dsk/c6t5d0
zpool add -f data1 spare c6t5d0

fmadm faulty shows no problems any more.

2011-01-25 t3fs11 drive failure (no automatic failover, scsi timeouts)

zpool status hangs. fmadm faulty takes a long time to return (minutes).

root@t3fs11 $ fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jan 25 14:58:34 145d27a2-93c2-e3d0-8719-c063b734b1a9  DISK-8000-0X   Major

Host        : t3fs11
Platform    : Sun Fire X4540    Chassis_id  : 0947AMR033

Fault class : fault.io.disk.predictive-failure
Affects     : dev:///:devid=id1,sd@n5000c50019ce2eb9//pci@0,0/pci10de,375@b/pci1000,1000@0/sd@6,0
                  faulted but still in service
FRU         : "HD_ID_14" (hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5T450:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=14/disk=0)
                  faulty

Description : SMART health-monitoring firmware reported that a disk
              failure is imminent.
              Refer to http://sun.com/msg/DISK-8000-0X for more information.

Response    : None.

Impact      : It is likely that the continued operation of
              this disk will result in data loss.

Action      : Schedule a repair procedure to replace the affected disk.
              Use fmdump -v -u  to identify the disk.

SP logs. Notice, that the two older failures have already been repaired (see above). But it shows that there were no hints or things like temperature problems between those failures and now.

ID     Date/Time                 Class     Type      Severity
-----  ------------------------  --------  --------  --------
1792   Wed Jan 26 14:11:36 2011  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
1791   Tue Jan 25 14:57:58 2011  IPMI      Log       critical
       ID =  3b4 : 01/25/2011 : 14:57:58 : Drive Slot : DBP/HDD14/STATE : Drive
       Fault
1790   Thu Nov 18 17:51:08 2010  IPMI      Log       critical
       ID =  3b3 : 11/18/2010 : 17:51:08 : Drive Slot : DBP/HDD40/STATE : Drive
       Fault
1789   Thu Nov 18 17:51:07 2010  IPMI      Log       critical
       ID =  3b2 : 11/18/2010 : 17:51:07 : Drive Slot : DBP/HDD33/STATE : Drive
       Fault

On the system console

Jan 26 14:29:29 t3fs11  Disconnected command timeout for Target 6
Jan 26 14:29:30 t3fs11 scsi: WARNING: /pci@0,0/pci10de,375@b/pci1000,1000@0/sd@6,0 (sd22):
Jan 26 14:29:30 t3fs11  SCSI transport failed: reason 'reset': giving up

I had to shut down the system forcefully through the ILOM.

The system came up again in apparently a healthy state

root@t3fs11 $ zpool status -x
all pools are healthy

Checking the SMART table manually reveals also for this disk a high Reallocated sector count.

hd -R
...
11 c2t3  113880158            0 30 0 4355539612 8618 0 30 0 0 0 0 538968096  32   0  21 113880158 0 0 0
12 c2t4  47366463            0 31 1 111728827925 8598 0 31 0 0 0 0 454688795  27   0  21 47366463 0 0 0
13 c2t5  72532704            0 31 0 8647554845 8598 0 31 0 0 0 0 488374301  28   0  21 72532704 0 0 0
14 c2t6  214809788            0 31 2047 17239657814 8598 0 31 0 5028 17180131357 0 522059807  31   0  21 214809788 0 0 0
15 c2t7  46182055            0 31 1 8650188048 8598 0 31 0 0 0 0 555745313  33   0  22 46182055 0 0 0
16 c3t0  235541355            0 30 1 4353410746 8598 0 30 0 0 0 0 437846042  26   0  21 235541355 0 0 0
...

Initiating a manual replacement of the disk in the zpool configuration:

root@t3fs11 $ zpool replace data1 c2t6d0 c6t7d0

root@t3fs11 $ zpool status -x
  pool: data1
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 0.11% done, 8h25m to go
config:

        NAME          STATE     READ WRITE CKSUM
...
         raidz2      ONLINE       0     0     0
            c1t4d0    ONLINE       0     0     0
            c2t1d0    ONLINE       0     0     0
            spare     ONLINE       0     0     0
              c2t6d0  ONLINE       0     0     0
              c6t7d0  ONLINE       0     0     0  523M resilvered
            c3t3d0    ONLINE       0     0     0
            c4t0d0    ONLINE       0     0     0
            c4t5d0    ONLINE       0     0     0
            c5t2d0    ONLINE       0     0     0
            c5t7d0    ONLINE       0     0     0
            c6t4d0    ONLINE       0     0     0
        spares
          c6t7d0      INUSE     currently in use
          c6t6d0      AVAIL
          c6t5d0      AVAIL

turned on dcache.

A few minutes later the system blocked again. Again shutdown and came up fine. It seems that the OS still tries to communicate with the broken disk, even though it is being replaced!

I tried to offline and unconfigure the disk to prevent any communications. I also kept the machine out of dcache, so no I/O at all, except for the resilvering, would take place.

root@t3fs11 $ zpool offline data1 c2t6d0
root@t3fs11 $ cfgadm -c unconfigure c2::dsk/c2t6d0

This worked, albeit it took 2h more than the projected time... 8.5 hours until into the night. Resilver looks ok. I started dcache again the following morning, and the system seems to run stably.

cfgadm -c configure c2::dsk/c2t6d0
zpool replace data1 c2t6d0

2011-02-22 t3fs11 immanent drive failure (no automatic failover yet)

ILOM recorded an immanent drive failure and sent mail.

Gather information:

root@t3fs11 $ fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Feb 22 14:49:40 7a731175-c212-4def-d3f7-de73f2db6441  DISK-8000-0X   Major

Host        : t3fs11
Platform    : Sun Fire X4540    Chassis_id  : 0947AMR033

Fault class : fault.io.disk.predictive-failure
Affects     : dev:///:devid=id1,sd@n5000c50019e13105//pci@3c,0/pci10de,377@a/pci1000,1000@0/sd@3,0
                  faulted but still in service
FRU         : "HD_ID_27" (hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5XGFF:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=27/disk=0)
                  faulty

Description : SMART health-monitoring firmware reported that a disk
              failure is imminent.
              Refer to http://sun.com/msg/DISK-8000-0X for more information.

Response    : None.

Impact      : It is likely that the continued operation of
              this disk will result in data loss.

Action      : Schedule a repair procedure to replace the affected disk.
              Use fmdump -v -u  to identify the disk.



root@t3fs11 $  fmdump -v -u7a731175-c212-4def-d3f7-de73f2db6441
TIME                 UUID                                 SUNW-MSG-ID
Feb 22 14:49:41.1532 7a731175-c212-4def-d3f7-de73f2db6441 DISK-8000-0X
  100%  fault.io.disk.predictive-failure

        Problem in: hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5XGFF:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=27/disk=0
           Affects: dev:///:devid=id1,sd@n5000c50019e13105//pci@3c,0/pci10de,377@a/pci1000,1000@0/sd@3,0
               FRU: hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5XGFF:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=27/disk=0
          Location: HD_ID_27

From the hd utility we can see that slot 27 maps to c4t3

What does ZFS know?

root@t3fs11 $ zpool status -x
all pools are healthy

root@t3fs11 $ zpool status data1
# this command still works perfectly. ZFS is not yet aware that something is wrong.

Lets investigate the SMART tables

root@t3fs11 $ hd -R
...
23 c3t7  90131225            0 31 2 8655799187 9269 0 31 0 0 0 0 555548704  32   0  21 90131225 0 0 0
24 c4t0  84145097            0 32 2 8654720307 9249 0 32 0 0 0 0 505020445  29   0  23 84145097 0 0 0
25 c4t1  134640711            0 32 0 63177304 9249 0 32 0 0 0 0 521797662  30   0  21 134640711 0 0 0
26 c4t2  143683009            0 32 1 8653043032 9249 0 32 0 0 0 0 555483168  32   0  22 143683009 0 0 0
27 c4t3  172966231            0 32 2043 67914384 9249 0 32 0 130 0 0 572391457  33   0  22 172966231 4 4 0
28 c4t4   2460622            0 32 3 63332723 9249 0 32 0 0 0 0 488112155  27   0  21 2460622 0 0 0
29 c4t5  155198022            0 32 0 65310109 9249 0 32 0 0 0 0 521863198  30   0  23 155198022 0 0 0
...

Information from smartmon

root@t3fs11 $ smartctl -a /dev/rdsk/c4t3d0 
smartctl version 5.38 [i386-pc-solaris2.8] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SEAGATE ST31000NSSUN1.0T 094555XGFF
Serial Number:    9QJ5XGFF
Firmware Version: SU0E
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Wed Feb 23 14:07:04 2011 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
...
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
  5 Reallocated_Sector_Ct   0x0033   001   001   036    Pre-fail  Always   FAILING_NOW 2043
 ...

...

OK. We want to take this disk out of the active pool and immediately substitute it with one of the spares. We need to get the name mapping for the unconfigure command

root@t3fs11 $ cfgadm -a
Ap_Id                          Type         Receptacle   Occupant     Condition
c1                             scsi-bus     connected    configured   unknown
c1::dsk/c1t0d0                 disk         connected    configured   unknown
...
c4::dsk/c4t3d0                 disk         connected    configured   unknown
...

We offline the disk and prevent the system to further interact with it

root@t3fs11 $ zpool offline data1 c4t3d0
root@t3fs11 $ cfgadm -c unconfigure c4::dsk/c4t3d0

We initialize the resilvering to a spare

root@t3fs11 $ zpool replace data1 c4t3d0 c6t6d0

2011-03-12 t3fs07 disk c2t0d0 failure

ILOM recorded a drive failure and sent an e-mail.

ID =  1ce : 03/12/2011 : 02:29:05 : Drive Slot : DBP/HDD8/STATE : Drive Fault

Also smartd

Device: /dev/rdsk/c2t0d0, FAILED SMART self-check. BACK UP DATA NOW!

Information by smartmon:

root@t3fs07 $ smartctl -a /dev/rdsk/c2t0d0
smartctl version 5.38 [i386-pc-solaris2.8] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SEAGATE ST31000NSSUN1.0T 094455T8NT
Serial Number:    9QJ5T8NT
Firmware Version: SU0E
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sun Mar 13 11:15:35 2011 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
...

What does ZFS know?

root@t3fs07 $ zpool status -x
  pool: data1
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
 scrub: resilver completed after 9h42m with 0 errors on Sat Mar 12 11:04:51 2011
...
            spare     DEGRADED     0     0 28.4M
              c2t0d0  REMOVED      0     0     0
              c6t5d0  ONLINE       0     0     0  515G resilvered

Remove the disk from ZFS:

root@t3fs07 $ zpool offline data1 c2t0d0

It produces a transition from REMOVED => OFFLINE:

root@t3fs07 $ zpool status -x|grep c2t0d0
              c2t0d0  OFFLINE      0     0     0

Remove the disk from Solaris:

root@t3fs07 $ cfgadm -a | grep c2t0d0
c2::dsk/c2t0d0                 disk         connected    configured   unknown

root@t3fs07 $ cfgadm -c unconfigure c2::dsk/c2t0d0

root@t3fs07 $ cfgadm -a | grep c2t0d0              
c2::rdsk/c2t0d0                disk         connected    unconfigured unknown

We changed the disk with a spare one, after the physical change we got a state transition to 'configured':

root@t3fs07 $ cfgadm -a | grep c2t0d0     
c2::dsk/c2t0d0                 disk         connected    configured   unknown

and in ZFS transition OFFLINE => REMOVED:

root@t3fs07 $ zpool status | grep c2t0d0
              c2t0d0  REMOVED      0     0     0

Replace the hot spare with the new disk:

root@t3fs07 $ zpool replace data1 c2t0d0        

root@t3fs07 $ zpool status
...
 scrub: resilver in progress for 0h0m, 0.16% done, 10h5m to go
...
          raidz2            DEGRADED     0     0     0
            c1t3d0          ONLINE       0     0     0
            spare           DEGRADED     0     0 28.4M
              replacing     DEGRADED     0     0     0
                c2t0d0s0/o  FAULTED      0     0     0  corrupted data
                c2t0d0      ONLINE       0     0     0  984M resilvered
              c6t5d0        ONLINE       0     0     0  94K resilvered
...

Opened the case SR 3-3171818131 with Oracle about the broken disk SN. 9QJ5T8NT ST31000340NS on system SUN FIRE X4540 SN.0949AMR020 replaced with disk SN. 9QJ5KV96 ST31000340NS.

2011-03-13 t3fs07 disk c2t6d0 proactive maintenance

Disk c2t6d0 was found plenty of errors, but not failed, ZFS replaced it by using the spare disk c6t6d0:

root@t3fs07 $ smartctl -a /dev/rdsk/c2t6d0
smartctl version 5.38 [i386-pc-solaris2.8] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     SEAGATE ST31000NSSUN1.0T 094455V7M5
Serial Number:    9QJ5V7M5
...
root@t3fs07 $ zpool  status
            spare     DEGRADED     0     0  620K
              c2t6d0  FAULTED     28     0     0  too many errors
              c6t6d0  ONLINE       0     0     0

we got SMARTd e-mail like:

Device: /dev/rdsk/c2t6d0, 332 Offline uncorrectable sectors

So we decided to replace it with a new disk SN.9QJ5R6W5; we ran:

root@t3fs07 $ cfgadm -a | grep c2t6d0
c2::dsk/c2t6d0                 disk         connected    configured   unknown

root@t3fs07 $ cfgadm -c unconfigure c2::dsk/c2t6d0

root@t3fs07 $ cfgadm -a | grep c2t6d0             
c2::rdsk/c2t6d0                disk         connected    unconfigured unknown

and suddenly we got this ILOM e-mail:

ID =  1d3 : 03/13/2011 : 18:33:42 : Drive Slot : DBP/HDD14/STATE : Hot Spare

After the disk change we ran:

root@t3fs07 $ zpool replace data1 c2t6d0

and zpool status reports:

...
            spare           DEGRADED     0     0  620K
              replacing     DEGRADED     0     0     0
                c2t6d0s0/o  FAULTED     28     0     0  too many errors
                c2t6d0      ONLINE       0     0     0  1.46G resilvered
              c6t6d0        ONLINE       0     0     0  107K resilvered
            c3t3d0          ONLINE       0     0     0
...

Again We updated Oracle case SR 3-3171818131.

2011-03-14 t3fs07 Spare disk staying in an apparently healthy raidz2 vdev. Needed to manually remove it.

For some strange reason there was a spare already involved in the first raidz vdev of the pool. The disk it replaced and itself both looked good and seemed to serve the same purpose in the raidz. We tried to put it back into the unused spare array.

root@t3fs07 $ zpool status data1
  pool: data1
 state: ONLINE
 scrub: resilver completed after 7h37m with 0 errors on Mon Mar 14 01:23:01 2011
config:

        NAME          STATE     READ WRITE CKSUM
        data1         ONLINE       0     0     0
          raidz2      ONLINE       0     0     0
            c1t0d0    ONLINE       0     0     0
            spare     ONLINE       0     0     0
              c1t5d0  ONLINE       0     0     0
              c6t7d0  ONLINE       0     0     0
            c2t2d0    ONLINE       0     0     0
            c2t7d0    ONLINE       0     0     0
            c3t4d0    ONLINE       0     0     0
            c4t1d0    ONLINE       0     0     0
...

Detaching the spare helped

root@t3fs07 $ zpool detach data1 c6t7d0
root@t3fs07 $ zpool status data1
  pool: data1
 state: ONLINE
 scrub: resilver completed after 7h37m with 0 errors on Mon Mar 14 01:23:01 2011
config:

        NAME        STATE     READ WRITE CKSUM
        data1       ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
            c2t7d0  ONLINE       0     0     0
            c3t4d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c4t6d0  ONLINE       0     0     0
            c5t3d0  ONLINE       0     0     0
            c6t0d0  ONLINE       0     0     0

2011-03-15 t3fs07 ILOM reports disk failure but fmadm is unaware of it - ILOM bug?

Email message from ILOM and ILOM logs contain the following

ID =  1d7 : 03/15/2011 : 15:47:00 : Drive Slot : DBP/HDD8/STATE : Drive Fault

Checking fmadm and ZFS status shows that OS and FS are unaware of any fault. Also the chassis has no yellow service LED lighted.

root@t3fs07 $ fmadm faulty
root@t3fs07 $ zpool status -x
all pools are healthy

The hd tool shows that HDD8 maps to c2t0. There do not seem to be extraordinarily many SMART failures on that disk

root@t3fs07 $ hd -R
 0 c1t0  77039094            0 22 0 8731708701 9762 0 22 0 0 0 0 555286555  27   0  20 77039094 0 0 0
 1 c1t1  194958566            0 22 44 141875462526 9763 0 22 0 0 0 0 589037598  30   0  20 194958566 0 0 0
 2 c1t2  91496381            0 22 0 17325275533 9762 0 22 0 0 0 0 589037598  30   0  19 91496381 0 0 0
 3 c1t3  182739422            0 91 0 78586645 12492 0 89 0 0 1 0 638976033  33   0  17 182739422 0 0 0
 4 c1t4  159143012            0 22 0 348032004841 9762 0 22 0 0 0 0 521666585  25   0  19 159143012 0 0 0
 5 c1t5  73493979            0 23 14 8734333959 9762 0 24 0 8 0 0 589037597  29   0  20 73493979 0 0 0
 6 c1t6  27164439            0 22 0 25912081303 9763 0 22 0 0 0 0 605880350  30   0  20 27164439 0 0 0
 7 c1t7  63012202            0 22 0 8734480074 9762 0 22 0 0 0 0 622723103  31   0  20 63012202 0 0 0
 8 c2t0  234259907            0 38 0 46188498 9044 0 38 0 0 0 0 538247194  26   0  11 234259907 0 0 0
 9 c2t1  16728995            0 21 1 13023489736 9782 0 21 0 0 0 0 589037597  29   0  21 16728995 0 0 0

I think that with high probability there may be some malfunction in the ILOM. I am resetting it now.

2011-04-21 t3fs11 Nagios reports disk failure but we didn't get the ILOM e-mail or the SMARTd e-mail

Nagios

Nagios was the first to realize about the problem ( because its active checks ):

Notification Type: PROBLEM

Service: check_zfs_data1
Host: t3fs11
Address: 192.33.123.51
State: WARNING

Date/Time: 04-21-2011 02:42:08

Additional Info:

WARNING ZPOOL data1 : DEGRADED {Size:40.6T Used:26.5T Avail:14.1T Cap:65%} raidz2:DEGRADED (c4t4d0:FAULTED)

fmadm

This is what fmadm reports:

root@t3fs11 $ fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Apr 21 02:30:30 2e66bdd5-5534-c779-a89e-ee6fba716380  ZFS-8000-FD    Major    

Host        : t3fs11
Platform    : Sun Fire X4540    Chassis_id  : 0947AMR033              

Fault class : fault.fs.zfs.vdev.io
Affects     : zfs://pool=data1/vdev=c8e86b7110798cbe
                  faulted and taken out of service
Problem in  : zfs://pool=data1/vdev=c8e86b7110798cbe
                  faulted and taken out of service

Description : The number of I/O errors associated with a ZFS device exceeded
                     acceptable levels.  Refer to http://sun.com/msg/ZFS-8000-FD
              for more information.

Response    : The device has been offlined and marked as faulted.  An attempt
                     will be made to activate a hot spare if available. 

Impact      : Fault tolerance of the pool may be compromised.

Action      : Run 'zpool status -x' and replace the bad device.

SMARTd

SMARTd was aware of the problem, but any e-mail was sent; after a kill and restart of SMARTd I got 2 e-mails:

Device: /dev/rdsk/c4t4d0, 353 Currently unreadable (pending) sectors
Device: /dev/rdsk/c4t4d0, 353 Offline uncorrectable sectors

about these missed SMARTd e-mails, I guess it was my fault because I changed a disk in t3fs11 some days ago and I didn't kill/restart SMARTd with the command:

nohup /opt/csw/sbin/smartd -q never -d &

netxt time that I'll change a disk I'll reset SMARTd as well.

ILOM

We got the ILOM e-mail several hours later, luckily..

ID =  412 : 04/21/2011 : 11:57:49 : Drive Slot : DBP/HDD24/STATE : Drive Fault

2011-04-23 t3fs11 disk failure, scsi timeouts, no autumatic ZFS failover

1994   Tue Aug 23 22:19:05 2011  IPMI      Log       critical
       ID =  419 : 08/23/2011 : 22:19:05 : Drive Slot : DBP/HDD9/STATE : Drive F
       ault
1993   Tue Aug 23 22:18:22 2011  IPMI      Log       critical
       ID =  418 : 08/23/2011 : 22:18:22 : Drive Slot : DBP/HDD9/STATE : Drive F
       ault

1985   Fri Aug 19 16:31:25 2011  Audit     Log       minor
       root : Close Session : object = /session/type : value = shell : success
1984   Fri Aug 19 15:50:09 2011  Audit     Log       minor
       root : Open Session : object = /session/type : value = shell : success
1983   Mon Aug 15 20:11:29 2011  Email     Connection  major
       Alert rule 1: Failed to open smtp connection
1982   Mon Aug 15 20:08:19 2011  IPMI      Log       critical
       ID =  415 : 08/15/2011 : 20:08:19 : Drive Slot : DBP/HDD9/STATE : Drive F
       ault
1981   Thu Apr 21 17:09:51 2011  IPMI      Log       critical
       ID =  414 : 04/21/2011 : 17:09:51 : Drive Slot : DBP/HDD28/STATE : Hot Sp
       are
1980   Thu Apr 21 16:48:44 2011  IPMI      Log       critical
       ID =  413 : 04/21/2011 : 16:48:44 : Drive Slot : DBP/HDD28/STATE : Hot Sp
       are
1979   Thu Apr 21 11:57:49 2011  IPMI      Log       critical
       ID =  412 : 04/21/2011 : 11:57:49 : Drive Slot : DBP/HDD24/STATE : Drive
       Fault
1978   Wed Apr 20 12:44:24 2011  Audit     Log       minor
       root : Close Session : object = /session/type : value = shell : success



#   fmadm  takes a long time (minutes):

root@t3fs11 $ fmadm faulty
Aug 24 10:04:43 t3fs11 scsi: WARNING: /pci@0,0/pci10de,375@b/pci1000,1000@0 (mpt1):
Aug 24 10:04:43 t3fs11  Disconnected command timeout for Target 1
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Aug 23 22:18:52 febe5ca3-9978-c5bc-c551-e0d74e165743  DISK-8000-0X   Major

Host        : t3fs11
Platform    : Sun Fire X4540    Chassis_id  : 0947AMR033

Fault class : fault.io.disk.predictive-failure
Affects     : dev:///:devid=id1,sd@n5000c50019d0c890//pci@0,0/pci10de,375@b/pci1000,1000@0/sd@1,0
                  faulted but still in service
FRU         : "HD_ID_9" (hc://:product-id=Sun-Fire-X4540:chassis-id=0947AMR033:server-id=t3fs11:serial=9QJ5RCC2:part=ATA-SEAGATE-ST31000N:revision=SU0E/bay=9/disk=0)
                  faulty

Description : SMART health-monitoring firmware reported that a disk
              failure is imminent.
              Refer to http://sun.com/msg/DISK-8000-0X for more information.

Response    : None.

Impact      : It is likely that the continued operation of
              this disk will result in data loss.

Action      : Schedule a repair procedure to replace the affected disk.
              Use fmdump -v -u  to identify the disk.

As in earlier cases the system had to be forcefully shut down. I also reset the SP since we had had problems with receiving email messages from it, so I wanted to start from a clean system. As before, the restart of the system involved some longer waiting time in the initialization phase.

-> stop -force /SYS
Are you sure you want to immediately stop /SYS (y/n)? y
Stopping /SYS immediately

-> reset /SP
Are you sure you want to reset /SP (y/n)? y
Performing reset on /SP

I started the system again, and it came up cleanly. ZFS seems currently to be ignorant of the disk problem. It is just detected at the ILOM/fmadm level

root@t3fs11 $ zpool status -x
all pools are healthy

reading out the SMART values

root@t3fs11 $ hd -R
 0 c1t0  17249968            0 33 5 373763131589 13633 0 33 0 0 0 0 387252247  23   0  21 17249968 0 0 0
 1 c1t1  170699931            0 33 36 12983724431 13632 0 33 0 0 0 0 437780506  26   0  23 170699931 0 0 0
 2 c1t2  243231875            0 33 1 107479759363 13633 0 33 0 0 0 0 471466012  28   0  22 243231875 0 0 0
 3 c1t3  131051304            0 32 2 100680997 13652 0 32 0 0 0 0 521994271  31   0  23 131051304 0 0 0
 4 c1t4  195635912            0 33 3 102820889 13633 0 33 0 0 0 0 387252247  23   0  21 195635912 0 0 0
 5 c1t5  90580622            0 33 0 932108957468 13633 0 33 0 1 0 0 420937753  25   0  21 90580622 0 0 0
 6 c1t6  186048626            0 33 0 34457395216 13633 0 33 0 0 0 0 437780506  26   0  21 186048626 0 0 0
 7 c1t7  181648238            0 32 0 8698160007 13651 0 32 0 0 0 0 471531548  28   0  22 181648238 0 0 0
 8 c2t0  137122786            0 34 3 206262087601 13632 0 34 0 0 0 0 387186711  23   0  20 137122786 0 0 0
 9 c2t1  76656715            0 34 2045 17284731974 13632 0 34 0 172 4295032833 0 404160536  24   0  22 76656715 2 2 0
10 c2t2  225594443            0 34 21 102963032 13632 0 34 0 0 0 0 437780506  26   0  21 225594443 0 0 0
11 c2t3  144645751            0 33 0 4396831575 13652 0 33 0 0 0 0 454688795  27   0  21 144645751 0 0 0
12 c2t4  161275767            0 34 1 1374496400531 13632 0 34 0 0 0 0 387186711  23   0  20 161275767 0 0 0
13 c2t5  165470843            0 34 1 8692164285 13632 0 34 0 0 0 0 404095000  24   0  21 165470843 0 0 0
14 c2t6   5576874            0 32 0 47547043 6202 0 32 0 0 0 0 421003289  25   0  17 5576874 0 0 0
15 c2t7  179049698            0 34 1 8695524291 13632 0 34 0 0 0 0 471531548  28   0  22 179049698 0 0 0
16 c3t0  181158015            0 33 6 12985284255 13632 0 33 0 0 0 0 370409494  22   0  20 181158015 0 0 0
17 c3t1  131064129            0 33 3 90302906083 13632 0 33 0 0 0 0 420937753  25   0  23 131064129 0 0 0
18 c3t2  18212729            0 34 8 17283364757 13632 0 34 0 0 0 0 454623259  27   0  23 18212729 0 0 0
19 c3t3  238229364            0 21 0 3256974 66 0 21 0 0 0 0 454688795  27   0  20 238229364 0 0 0
20 c3t4  179090552            0 34 11 38757158962 13632 0 34 0 0 0 0 387252247  23   0  21 179090552 0 0 0
21 c3t5  139475973            0 34 0 43049644397 13632 0 34 0 0 0 0 387317783  23   0  21 139475973 0 0 0
22 c3t6  21702988            0 34 9 1052372408231 13632 0 34 0 0 0 0 421003289  25   0  21 21702988 0 0 0
23 c3t7  200787145            0 33 2 17285706890 13652 0 33 0 2 0 0 437846042  26   0  21 200787145 0 0 0
24 c4t0  219100722            0 26 0 4327199423 3091 0 26 0 0 0 0 370409494  22   0  16 219100722 0 0 0
25 c4t1  171714245            0 34 1 124657330319 13632 0 34 0 0 0 0 404095000  24   0  21 171714245 0 0 0
26 c4t2  210424931            0 34 13 12984965035 13632 0 34 0 0 0 0 421003289  26   0  22 210424931 0 0 0
27 c4t3  200584910            0 11 0 57403047 6582 0 11 0 0 0 0 471466012  28   0  19 200584910 0 0 0
28 c4t4  26488941            0 25 0 29022271 2998 0 25 0 0 0 0 387252247  23   0  18 26488941 0 0 0
29 c4t5  47445573            0 34 0 184788229491 13632 0 34 0 0 0 0 404160536  24   0  23 47445573 0 0 0
30 c4t6  12512631            0 34 0 4398114222 13632 0 34 0 0 0 0 404095000  23   0  20 12512631 0 0 0
31 c4t7  121268631            0 34 0 1408851383524 13632 0 34 0 0 0 0 454688795  27   0  23 121268631 0 0 0
32 c5t0  80681234            0 34 1 738842372009 13632 0 34 0 0 0 0 387252247  23   0  21 80681234 0 0 0
33 c5t1  193436627            0 8 1 58130853 6713 0 8 0 0 0 0 404160536  24   0  23 193436627 0 0 0
34 c5t2  57580333            0 34 0 104257729 13632 0 34 0 0 0 0 420937753  25   0  21 57580333 0 0 0
35 c5t3   2134880            0 34 80 12990605345 13632 0 34 0 0 0 0 437846042  26   0  22 2134880 0 0 0
36 c5t4  82431556            0 34 0 12984848023 13632 0 34 0 0 0 0 370409494  22   0  20 82431556 0 0 0
37 c5t5  210183630            0 34 0 4402858349 13632 0 34 0 0 0 0 437780506  26   0  24 210183630 0 0 0
38 c5t6  195477990            0 33 0 103055322 13678 0 33 0 0 0 0 420937753  25   0  21 195477990 0 0 0
39 c5t7  43367020            0 34 4 25876628588 13632 0 34 0 0 0 0 437846042  26   0  22 43367020 0 0 0
40 c6t0  78208807            0 40 0 8684273605 11990 0 39 0 0 0 0 387252247  23   0  14 78208807 0 0 0
41 c6t1  229417618            0 34 0 4395271366 13632 0 34 0 0 0 0 404095000  24   0  22 229417618 0 0 0
42 c6t2  66862671            0 34 5 12993000752 13632 0 34 0 0 0 0 421003289  25   0  23 66862671 0 0 0
43 c6t3  107140408            0 33 0 4400705110 13651 0 33 0 0 0 0 454688795  27   0  23 107140408 0 0 0
44 c6t4  53175997            0 34 6 322226459394 13632 0 34 0 0 4295032833 0 370409494  22   0  20 53175997 0 0 0
45 c6t5  125733869            0 33 0 32685491 13684 0 33 0 0 0 0 404160536  24   0  23 125733869 0 0 0
46 c6t6  240801455            0 34 5 30110127601 13632 0 34 0 0 0 0 421003289  25   0  22 240801455 0 0 0
47 c6t7  176375110            0 32 0 4351157937 13677 0 33 0 0 0 0 437846042  26   0  21 176375110 0 0 0

Mapping:

c2t1d0	c2::dsk/c2t1d0	/pci@0,0/pci10de,375@b/pci1000,1000@0/sd@1,0

c2t1d0

c2::dsk/c2t1d0

/pci@0,0/pci10de,375@b/pci1000,1000@0/sd@1,0

Offlining the disk and initiating the resilver

root@t3fs11 $ zpool offline data1 c2t1d0
cfgadm -c unconfigure c2::dsk/c2t1d0

root@t3fs11 $ zpool replace data1 c2t1d0 c6t6d0

List of occurences year 2013:

05-08-2013 - t3fs07 OS/disks crash

Server almost totally frozen

Aug  5 16:53:22 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug  5 16:53:22 t3fs07  Disconnected command timeout for Target 1
Aug  5 16:54:33 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug  5 16:54:33 t3fs07  Disconnected command timeout for Target 1
Aug  5 16:55:44 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug  5 16:55:44 t3fs07  Disconnected command timeout for Target 1
Aug  5 16:55:45 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@1,0 (sd33):
Aug  5 16:55:45 t3fs07  SCSI transport failed: reason 'reset': giving up
Aug  5 16:56:55 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug  5 16:56:55 t3fs07  Disconnected command timeout for Target 1
Aug  5 16:58:06 t3fs07 scsi: WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
Aug  5 16:58:06 t3fs07  Disconnected command timeout for Target 1

---++++ REBOOT

LSI Corporation MPT SAS BIOS                                                    
MPTBIOS-6.26.00.00 (2008.10.14)                <-----                                 
Copyright 2000-2008 LSI Corporation.                                            
                                                                                
Adapter configuration may have changed, reconfiguration is suggested!           
Searching for devices at HBA 0...                                               
Searching for devices at HBA 1...                                               
                                                                                
                                                                                0de,376@f/pci1000,1000@0 (mpt5):
SLOT ID  LUN VENDOR   PRODUCT          REVISION   SIZE \ NV                     
---- --- --- -------- ---------------- ---------- ---------                     
  0    0  0  ATA      SEAGATE ST31000N SU0E          931 GB                     000,1000@0 (mpt5):
  0    1  0  ATA      SEAGATE ST31000N SU0E          931 GB                     
  0          LSILogic SAS1068E-IT      1.27.02.00  NV 2D:03       <-----
  0    0  0  ATA      SEAGATE ST31000N SU12          931 GB                     
  0    1  0  ATA      SEAGATE ST31000N SU0E          931 GB                     
  0          LSILogic SAS1068E-IT      1.27.02.00  NV 2D:03       <-----

---++++ GRUB ERROR

'/platform/i86pc/multiboot -B zfs-bootfs=rpool/61,bootpath="/pci@0,0/pci-ide@4/ o@0
ide@0/cmdk@0,0:a",diskdevid="id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____0000137F/a"'  
is loaded                                                                       
module /platform/i86pc/boot_archive                                             
zio_read_data failed                                                            
                                                                                
Error 16: Inconsistent filesystem structure                                     
  Booting 'Solaris 10 10/09 s10x_u8wos_08a X86'                                 
                                                                                
findroot (pool_rpool,0,a)                                                       
 Filesystem type is zfs, partition type 0xbf                                    0de,376@f/pci1000,1000@0 (mpt5):
kernel$ /platform/i86pc/multiboot -B $ZFS-BOOTFS                                
loading '/platform/i86pc/multiboot -B $ZFS-BOOTFS' ...                          
   [Multiboot-elf, <0x1000000:0x1442b:0x12901>, shtab=0x1027258, entry=0x100000 000,1000@0 (mpt5):
0]                                                                              
'/platform/i86pc/multiboot -B zfs-bootfs=rpool/61,bootpath="/pci@0,0/pci-ide@4/ 
ide@0/cmdk@0,0:a",diskdevid="id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____0000137F/a"'  
is loaded                                                                       
module /platform/i86pc/boot_archive                                             
checksum verification failed                                                    
                                                                                
Error 16: Inconsistent filesystem structure                                     
                                                                                
Press any key to continue...

SOLARIS FAILSAFE ATTEMPT

SunOS Release 5.10 Version Generic_141445-09 64-bit
Copyright 1983-2009 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Booting to milestone "milestone/single-user:default".
Configuring devices.
WARNING: /pci@0,0/pci-ide@4/ide@0 unable to enable write cache targ=0
Searching for installed OS instances...
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@1,0 (sd41):
        SCSI transport failed: reason 'reset': retrying command

WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1
WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0/sd@1,0 (sd41):
        SCSI transport failed: reason 'reset': giving up

WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
        Disconnected command timeout for Target 1

AFTER THE NEW SOLARIS 11 INSTALLATION

# zfs snapshot rpool/ROOT/solaris@06-08-2013
# wget http://mirror.opencsw.org/opencsw/pkgutil.pkg
# pkgadd -d pkgutil.pkg
# /opt/csw/bin/pkgutil -i lsof -y
# /opt/csw/bin/pkgutil -i xpdf ggrep vim wgetpaste watch top sudo python pstree nrpe_plugin nano -y
# /opt/csw/bin/pkgutil -i nagios_plugins -y
# /opt/csw/bin/pkgutil -i nrpe -y
# /opt/csw/bin/pkgutil -i netsnmp -y
# /opt/csw/bin/pkgutil -y -i bash gawk emacs
# /opt/csw/bin/pkgutil -y -i smartmontools && svcadm disable cswsmartd
# /opt/csw/bin/pkgutil -y -i gsed rsync
# /opt/csw/bin/pkgutil -i -y CSWpm-libwww-perl
# svcadm disable cswrsyncd
# /opt/csw/bin/pkgutil -i netcat -y
# root@t3fs07:/export/home/jack# pkgadd -d ./SUNWhd-1.07.pkg
# root@t3fs07:/export/home/jack# hd -c -s 

platform = Sun Fire X4540
                                                                         
Device    Serial        Vendor   Model             Rev  Temperature     
------    ------        ------   -----             ---- -----------     
c10t0d0p0  9QJ5KV96      ATA      SEAGATE ST31000N  SU12 22 C (71 F)
c10t1d0p0  9QJ5QHCN      ATA      SEAGATE ST31000N  SU0E 26 C (78 F)
c10t2d0p0  W9K0HZ0U061L  ATA      Hitachi HUA72201  A3EA 25 C (77 F)
c10t3d0p0  F002PAJUSJ4F  ATA      HITACHI HUA7210S  AC5A 30 C (86 F)
c10t4d0p0  9QJ5RVKJ      ATA      SEAGATE ST31000N  SU0E 23 C (73 F)
c10t5d0p0  9QJ5R7JX      ATA      SEAGATE ST31000N  SU0E 25 C (77 F)
c10t6d0p0  9QJ5R6W5      ATA      SEAGATE ST31000N  SU12 25 C (77 F)
c10t7d0p0  A060PBK4JJTF  ATA      HITACHI HUA7210S  AC5A 30 C (86 F)
c11t0d0p0  F002PBJTH4KF  ATA      HITACHI HUA7210S  AC5A 23 C (73 F)
c11t1d0p0  9QJ5TMNF      ATA      SEAGATE ST31000N  SU0E 24 C (75 F)
c11t2d0p0  9QJ5V7FP      ATA      SEAGATE ST31000N  SU0E 25 C (77 F)
c11t3d0p0  9QJ5QJ1G      ATA      SEAGATE ST31000N  SU0E 27 C (80 F)
c11t4d0p0  9QJ5V7FA      ATA      SEAGATE ST31000N  SU0E 22 C (71 F)
c11t5d0p0  9QJ5QKCV      ATA      SEAGATE ST31000N  SU0E 25 C (77 F)
c11t6d0p0  9QJ5LT8H      ATA      SEAGATE ST31000N  SU0E 25 C (77 F)
c11t7d0p0  A060PBK528EF  ATA      HITACHI HUA7210S  AC5A 29 C (84 F)
c12t0d0p0  9QJ5TM9Z      ATA      SEAGATE ST31000N  SU0E 23 C (73 F)
c12t1d0p0  9QJ5RVQL      ATA      SEAGATE ST31000N  SU0E 24 C (75 F)
c12t2d0p0  W9K0HD2XKHTL  ATA      HITACHI H7210CA3  A3CB 25 C (77 F)
c12t3d0p0  9QJ5RW4M      ATA      SEAGATE ST31000N  SU0E 27 C (80 F)
c12t4d0p0  W9H0N01D14MV  ATA      Hitachi HUA72201  A3EA 22 C (71 F)
c12t5d0p0  9QJ5R9NJ      ATA      SEAGATE ST31000N  SU0E 25 C (77 F)
c12t6d0p0  9QJ7VM1J      ATA      SEAGATE ST31000N  SU0F 25 C (77 F)
c12t7d0p0  9QJ5MWQG      ATA      SEAGATE ST31000N  SU0F 26 C (78 F)
c13t0d0p0  9QJ5V7FS      ATA      SEAGATE ST31000N  SU0E 22 C (71 F)
c13t1d0p0  WMAW31661409  ATA      WDC WD1003FBYX-0  1V02 25 C (77 F)
c13t2d0p0  W9K0HZ082KVL  ATA      Hitachi HUA72201  A3EA 25 C (77 F)
c13t3d0p0  A060PBK4ZS0F  ATA      HITACHI HUA7210S  AC5A 29 C (84 F)
c13t4d0p0  9QJ5QY5N      ATA      SEAGATE ST31000N  SU0E 23 C (73 F)
c13t5d0p0  9QJ5RV8M      ATA      SEAGATE ST31000N  SU0E 25 C (77 F)
c13t6d0p0  9QJ5R6QV      ATA      SEAGATE ST31000N  SU0E 26 C (78 F)
c13t7d0p0  9QJ5NHR3      ATA      SEAGATE ST31000N  SU0E 27 C (80 F)
c14t0d0p0  9QJ5TMJE      ATA      SEAGATE ST31000N  SU0E 23 C (73 F)
c14t1d0p0  9QJ5RRV8      ATA      SEAGATE ST31000N  SU0E 25 C (77 F)
c14t2d0p0  9QJ5P70T      ATA      SEAGATE ST31000N  SU0E 25 C (77 F)
c14t3d0p0  9QJ4YZAX      ATA      SEAGATE ST31000N  SU0F 28 C (82 F)
c14t4d0p0  A060PBK56GJF  ATA      HITACHI HUA7210S  AC5A 24 C (75 F)
c14t5d0p0  9QJ5S4JF      ATA      SEAGATE ST31000N  SU0E 26 C (78 F)
c14t6d0p0  9QJ5TM8X      ATA      SEAGATE ST31000N  SU0E 27 C (80 F)
c14t7d0p0  9QJ5QQAF      ATA      SEAGATE ST31000N  SU0E 28 C (82 F)
c8d0p0    00014E4       -       UGB30SDC16H0P4    -    None 
c9t0d0p0  9QJ5TMAY      ATA      SEAGATE ST31000N  SU0E 23 C (73 F)
c9t1d0p0  9QJ5RR2P      ATA      SEAGATE ST31000N  SU0E 26 C (78 F)
c9t2d0p0  W9K0HD2XV2DL  ATA      HITACHI H7210CA3  A3CB 26 C (78 F)
c9t3d0p0  9QJ3C4MZ      ATA      SEAGATE ST31000N  SU0F 28 C (82 F)
c9t4d0p0  9QJ5QMCF      ATA      SEAGATE ST31000N  SU0E 22 C (71 F)
c9t5d0p0  9QJ5RV9K      ATA      SEAGATE ST31000N  SU0E 25 C (77 F)
c9t6d0p0  W9K0N015K0GL  ATA      Hitachi HUA72201  A3EA 25 C (77 F)
c9t7d0p0  9QJ5TN9J      ATA      SEAGATE ST31000N  SU0E 27 C (80 F)

-----------------------------SunFire X4540-------Rear-----------------
 3:    7:   11:   15:   19:   23:   27:   31:   35:   39:   43:   47:   
c9t3  c9t7  c10t3  c10t7  c11t3  c11t7  c12t3  c12t7  c13t3  c13t7  c14t3  c14t7  
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   
 2:    6:   10:   14:   18:   22:   26:   30:   34:   38:   42:   46:   
c9t2  c9t6  c10t2  c10t6  c11t2  c11t6  c12t2  c12t6  c13t2  c13t6  c14t2  c14t6  
^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   
 1:    5:    9:   13:   17:   21:   25:   29:   33:   37:   41:   45:   
c9t1  c9t5  c10t1  c10t5  c11t1  c11t5  c12t1  c12t5  c13t1  c13t5  c14t1  c14t5  
^b+   ^++   ^b+   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   
 0:    4:    8:   12:   16:   20:   24:   28:   32:   36:   40:   44:   
c9t0  c9t4  c10t0  c10t4  c11t0  c11t4  c12t0  c12t4  c13t0  c13t4  c14t0  c14t4  
^b+   ^++   ^b+   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   ^++   
-------*---------*-----------SunFire X4540---*---Front-----*-------*---

Summary:
        Vendor          Model                   Count
        ------          -----                   -----
        ATA             SEAGATE ST31000N        35
        ATA             Hitachi HUA72201         4
        ATA             HITACHI HUA7210S         6
        ATA             HITACHI H7210CA3         2
        ATA             WDC WD1003FBYX-0         1

                        Total Storage Devices = 48

WHICH ARE THE BROKEN DISKS ?

HD_26  c12t2
HD_41  c14t1   <-- this is really making going crazy Solaris that reacts with tens of: 
   Aug  6 22:44:18 t3fs07 scsi: [ID 107833 kern.warning] WARNING: /pci@3c,0/pci10de,376@f/pci1000,1000@0 (mpt5):
   Aug  6 22:44:18 t3fs07  Disconnected command timeout for Target 1
HD_47  c14t7

IMPORTING ZFS /data1 INTO SOLARIS 11

Aug  7 10:50:19 t3fs07 zfs: [ID 249136 kern.info] imported version 15 pool data1 using 34
Aug  7 10:53:13 t3fs07 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-LR, TYPE: Fault, VER: 1, SEVERITY: Major
Aug  7 10:53:13 t3fs07 EVENT-TIME: Wed Aug  7 10:53:12 CEST 2013
Aug  7 10:53:13 t3fs07 PLATFORM: Sun-Fire-X4540, CSN: 0949AMR020, HOSTNAME: t3fs07
Aug  7 10:53:13 t3fs07 SOURCE: zfs-diagnosis, REV: 1.0
Aug  7 10:53:13 t3fs07 EVENT-ID: 13ded51b-674e-c92b-f852-9a456cc01793
Aug  7 10:53:13 t3fs07 DESC: ZFS device 'id1,sd@n5000c50019c4f1c2/a' in pool 'data1' failed to open.
Aug  7 10:53:13 t3fs07 AUTO-RESPONSE: An attempt will be made to activate a hot spare if available.
Aug  7 10:53:13 t3fs07 IMPACT: Fault tolerance of the pool may be compromised.
Aug  7 10:53:13 t3fs07 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-LR for the latest service procedures and policies regarding this diagnosis.
Aug  7 10:53:13 t3fs07 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-LR, TYPE: Fault, VER: 1, SEVERITY: Major
Aug  7 10:53:13 t3fs07 EVENT-TIME: Wed Aug  7 10:53:13 CEST 2013
Aug  7 10:53:13 t3fs07 PLATFORM: Sun-Fire-X4540, CSN: 0949AMR020, HOSTNAME: t3fs07
Aug  7 10:53:13 t3fs07 SOURCE: zfs-diagnosis, REV: 1.0
Aug  7 10:53:13 t3fs07 EVENT-ID: a70d0f7d-f4ff-e396-d9da-cb8d5caa4841
Aug  7 10:53:13 t3fs07 DESC: ZFS device 'id1,sd@n5000c50019b40654/a' in pool 'data1' failed to open.
Aug  7 10:53:13 t3fs07 AUTO-RESPONSE: An attempt will be made to activate a hot spare if available.
Aug  7 10:53:13 t3fs07 IMPACT: Fault tolerance of the pool may be compromised.
Aug  7 10:53:13 t3fs07 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-LR for the latest service procedures and policies regarding this diagnosis.

AFTER ZFS COMPLETED THE RESILVERING

-bash-4.1# zpool status 
  pool: data1
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
        entire pool from backup.
        Run 'zpool status -v' to see device specific details.
   see: http://support.oracle.com/msg/ZFS-8000-8A
  scan: resilvered 1.44T in 11h53m with 1 errors on Wed Aug  7 22:47:36 2013
config:

        NAME                        STATE     READ WRITE CKSUM
        data1                       DEGRADED     0     0     1
          raidz2-0                  ONLINE       0     0     0
            c9t0d0                  ONLINE       0     0     0
            c9t5d0                  ONLINE       0     0     0
            c10t2d0                 ONLINE       0     0     0
            c10t7d0                 ONLINE       0     0     0
            c11t4d0                 ONLINE       0     0     0
            c12t1d0                 ONLINE       0     0     0
            c12t6d0                 ONLINE       0     0     0
            c13t3d0                 ONLINE       0     0     0
            c14t0d0                 ONLINE       0     0     0
          raidz2-1                  DEGRADED     0     0   121
            c9t1d0                  DEGRADED     0     0   121
            c9t6d0                  DEGRADED     0     0   121
            c10t3d0                 DEGRADED     0     0   121
            c11t0d0                 DEGRADED     0     0   121
            c11t5d0                 DEGRADED     0     0   121
            spare-5                 DEGRADED     0     0     0
              15623725476041760867  UNAVAIL      0     0     0
              c14t6d0               DEGRADED     0     0     0
            c12t7d0                 DEGRADED     1     0     0
            c13t4d0                 DEGRADED     0     0   121
            spare-8                 DEGRADED     0     0     0
              1583280912036438145   UNAVAIL      0     0     0
              c14t7d0               DEGRADED     0     0     0
          raidz2-2                  ONLINE       0     0     0
            c9t2d0                  ONLINE       0     0     0
            c9t7d0                  ONLINE       0     0     0
            c10t4d0                 ONLINE       0     0     0
            c11t1d0                 ONLINE       0     0     0
            c11t6d0                 ONLINE       0     0     0
            c12t3d0                 ONLINE       0     0     0
            c13t0d0                 ONLINE       0     0     0
            c13t5d0                 ONLINE       0     0     0
            c14t2d0                 ONLINE       0     0     0
          raidz2-3                  ONLINE       0     0     0
            c9t3d0                  ONLINE       0     0     0
            c10t0d0                 ONLINE       0     0     0
            c10t5d0                 ONLINE       0     0     0
            c11t2d0                 ONLINE       0     0     0
            c11t7d0                 ONLINE       0     0     0
            c12t4d0                 ONLINE       0     0     0
            c13t1d0                 ONLINE       0     0     0
            c13t6d0                 ONLINE       0     0     0
            c14t3d0                 ONLINE       0     0     0
          raidz2-4                  ONLINE       0     0     0
            c9t4d0                  ONLINE       0     0     0
            c10t1d0                 ONLINE       0     0     0
            c10t6d0                 ONLINE       0     0     0
            c11t3d0                 ONLINE       0     0     0
            c12t0d0                 ONLINE       0     0     0
            c12t5d0                 ONLINE       0     0     0
            c13t2d0                 ONLINE       0     0     0
            c13t7d0                 ONLINE       0     0     0
            c14t4d0                 ONLINE       0     0     0
        spares
          c14t7d0                   INUSE   
          c14t6d0                   INUSE   
          c14t5d0                   AVAIL   

errors: 1 data errors, use '-v' for a list

  pool: rpool
 state: ONLINE
  scan: none requested
config:

        NAME    STATE     READ WRITE CKSUM
        rpool   ONLINE       0     0     0
          c8d0  ONLINE       0     0     0

errors: No known data errors

DETACHING THE BROKEN DISKS

-bash-4.1# zpool detach data1 15623725476041760867
-bash-4.1# zpool detach data1 1583280912036438145

ALLOWING root LOGIN by SSH

http://veereshkumarn.blogspot.ch/2012/09/how-to-enable-ssh-root-login-in-solaris.html

ALSO THE NEW FLASH CARD IS BROKEN!

Aug  9 10:35:44 t3fs07 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-8A, TYPE: Fault, VER: 1, SEVERITY: Critical
Aug  9 10:35:44 t3fs07 EVENT-TIME: Fri Aug  9 10:35:44 CEST 2013
Aug  9 10:35:44 t3fs07 PLATFORM: Sun-Fire-X4540, CSN: 0949AMR020, HOSTNAME: t3fs07
Aug  9 10:35:44 t3fs07 SOURCE: zfs-diagnosis, REV: 1.0
Aug  9 10:35:44 t3fs07 EVENT-ID: 94b57bca-b249-cac8-90dd-e65056c254ae
Aug  9 10:35:44 t3fs07 DESC: A file or directory in pool 'rpool' could not be read due to corrupt data.
Aug  9 10:35:44 t3fs07 AUTO-RESPONSE: No automated response will occur.
Aug  9 10:35:44 t3fs07 IMPACT: The file or directory is unavailable.
Aug  9 10:35:44 t3fs07 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -xv' and examine the list of damaged files to determine what has been affected. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-8A for the latest service procedures and policies regarding this diagnosis.
Aug  9 10:35:47 t3fs07 fmd: [ID 377184 daemon.error] SUNW-MSG-ID: ZFS-8000-GH, TYPE: Fault, VER: 1, SEVERITY: Major
Aug  9 10:35:47 t3fs07 EVENT-TIME: Fri Aug  9 10:35:47 CEST 2013
Aug  9 10:35:47 t3fs07 PLATFORM: Sun-Fire-X4540, CSN: 0949AMR020, HOSTNAME: t3fs07
Aug  9 10:35:47 t3fs07 SOURCE: zfs-diagnosis, REV: 1.0
Aug  9 10:35:47 t3fs07 EVENT-ID: fe9c1a29-d87c-6bf2-cd5a-f2ca0042d38b
Aug  9 10:35:47 t3fs07 DESC: The number of checksum errors associated with ZFS device 'id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____000014E4/b' in pool 'rpool' exceeded acceptable levels.
Aug  9 10:35:47 t3fs07 AUTO-RESPONSE: The device has been marked as degraded. An attempt will be made to activate a hot spare if available.
Aug  9 10:35:47 t3fs07 IMPACT: Fault tolerance of the pool may be compromised.
Aug  9 10:35:47 t3fs07 REC-ACTION: Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -lx' for more information. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-GH for the latest service procedures and policies regarding this diagnosis.

-bash-4.1# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Aug 09 10:35:44 94b57bca-b249-cac8-90dd-e65056c254ae  ZFS-8000-8A    Critical 

Problem Status    : solved
Diag Engine       : zfs-diagnosis / 1.0
System
    Manufacturer  : unknown
    Name          : unknown
    Part_Number   : unknown
    Serial_Number : unknown

System Component
    Manufacturer  : Sun-Microsystems
    Name          : Sun-Fire-X4540
    Part_Number   : 602-4887-01
    Serial_Number : 0949AMR020
    Host_ID       : 00c18d96

----------------------------------------
Suspect 1 of 1 :
   Fault class : fault.fs.zfs.object.corrupt_data
   Certainty   : 100%
   Affects     : zfs://pool=916b26b45c63015a/pool_name=rpool
   Status      : faulted but still providing degraded service

   FRU
     Name             : "zfs://pool=916b26b45c63015a/pool_name=rpool"
        Status        : faulty

Description : A file or directory in pool 'rpool' could not be read due to
              corrupt data.

Response    : No automated response will occur.

Impact      : The file or directory is unavailable.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Run 'zpool status -xv' and examine the list of damaged files to
              determine what has been affected. Please refer to the associated
              reference document at http://support.oracle.com/msg/ZFS-8000-8A
              for the latest service procedures and policies regarding this
              diagnosis.

--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Aug 09 10:35:47 fe9c1a29-d87c-6bf2-cd5a-f2ca0042d38b  ZFS-8000-GH    Major    

Problem Status    : solved
Diag Engine       : zfs-diagnosis / 1.0
System
    Manufacturer  : unknown
    Name          : unknown
    Part_Number   : unknown
    Serial_Number : unknown

System Component
    Manufacturer  : Sun-Microsystems
    Name          : Sun-Fire-X4540
    Part_Number   : 602-4887-01
    Serial_Number : 0949AMR020
    Host_ID       : 00c18d96

----------------------------------------
Suspect 1 of 1 :
   Fault class : fault.fs.zfs.vdev.checksum
   Certainty   : 100%
   Affects     : zfs://pool=916b26b45c63015a/vdev=42e34d6f3fa7092a/pool_name=rpool/vdev_name=id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____000014E4/b
   Status      : faulted but still providing degraded service

   FRU
     Name             : "zfs://pool=916b26b45c63015a/vdev=42e34d6f3fa7092a/pool_name=rpool/vdev_name=id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____000014E4/b"
        Status        : faulty

Description : The number of checksum errors associated with ZFS device
              'id1,cmdk@AUGB30SDC16H0P4=SDC16H0_____000014E4/b' in pool 'rpool'
              exceeded acceptable levels.

Response    : The device has been marked as degraded. An attempt will be made
              to activate a hot spare if available.

Impact      : Fault tolerance of the pool may be compromised.

Action      : Use 'fmadm faulty' to provide a more detailed view of this event.
              Run 'zpool status -lx' for more information. Please refer to the
              associated reference document at
              http://support.oracle.com/msg/ZFS-8000-GH for the latest service
              procedures and policies regarding this diagnosis.

REBOOT => OS corrupted

svc.configd: smf(5) database integrity check of:

    /etc/svc/repository.db

  failed. The database might be damaged or a media error might have
  prevented it from being verified.  Additional information useful to
  your service provider  is in:

    /system/volatile/db_errors

  The system will not be able to boot until you have restored a working
  database.  svc.startd(1M) will provide a sulogin(1M) prompt for recovery
  purposes.  The command:

    /lib/svc/bin/restore_repository

  can be run to restore a backup version of your repository.  See
  http://support.oracle.com/msg/SMF-8000-MY for more information.

Requesting System Maintenance Mode
(See /lib/svc/share/README for more information.)
svc.configd exited with status 102 (database initialization failure)

Enter user name for system maintenance (control-d to bypass): root
Enter root password (control-d to bypass): 
single-user privilege assigned to root on /dev/console.
Entering System Maintenance Mode

Aug  9 11:46:35 su: pam_unix_cred: error creating /var/user/root: No such file or directory
Aug  9 11:46:35 su: pam_unix_cred: chown error on /var/user/root: No such file or directory
Aug  9 11:46:35 su: 'su root' succeeded for root on /dev/console
Oracle Corporation	SunOS 5.11	11.1	September 2012
-bash-4.1# 

-bash-4.1# /lib/svc/bin/restore_repository

See http://support.oracle.com/msg/SMF-8000-MY for more information on the use of
this script to restore backup copies of the smf(5) repository.

If there are any problems which need human intervention, this script will
give instructions and then exit back to your shell.
/lib/svc/bin/restore_repository[71]: [: /: arithmetic syntax error
The following backups of /etc/svc/repository.db exist, from
oldest to newest:

manifest_import-20130806_161857
manifest_import-20130806_162908
boot-20130806_185210
boot-20130807_102619
manifest_import-20130807_144541
manifest_import-20130807_150913
boot-20130808_102130
boot-20130809_102928

The backups are named based on their type and the time what they were taken.
Backups beginning with "boot" are made before the first change is made to
the repository after system boot.  Backups beginning with "manifest_import"
are made after svc:/system/manifest-import:default finishes its processing.
The time of backup is given in YYYYMMDD_HHMMSS format.

Please enter either a specific backup repository from the above list to
restore it, or one of the following choices:

	CHOICE		  ACTION
	----------------  ----------------------------------------------
	boot		  restore the most recent post-boot backup
	manifest_import	  restore the most recent manifest_import backup
	-seed-		  restore the initial starting repository  (All
			    customizations will be lost, including those
			    made by the install/upgrade process.)
	-quit-		  cancel script and quit

Enter response [boot]: 
Unable to open database "/etc/svc/repository-boot": disk I/O error

After confirmation, the following steps will be taken:

svc.startd(1M) and svc.configd(1M) will be quiesced, if running.
/etc/svc/repository.db
    -- renamed --> /etc/svc/repository.db_old_20130809_121110
//system/volatile/db_errors
    -- copied --> /etc/svc/repository.db_old_20130809_121110_errors
/etc/svc/repository-boot
    -- copied --> /etc/svc/repository.db 
and the system will be rebooted with reboot(1M).

Proceed [yes/no]? yes

Quiescing svc.startd(1M) and svc.configd(1M): done.
/etc/svc/repository.db
    -- renamed --> /etc/svc/repository.db_old_20130809_121110
//system/volatile/db_errors
    -- copied --> /etc/svc/repository.db_old_20130809_121110_errors
/etc/svc/repository-boot
    -- copied --> /etc/svc/repository.db
/etc/svc/repository.db.new.22: I/O error
Failed.  To start svc.start(1M) running, do: /usr/bin/prun 11
-bash-4.1#

07-08-2013 - t3fs11 strange spares behaviour

root@t3fs11 $ zpool status -v                                                                                                                                                                                                                                                              
  pool: data1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: resilver in progress for 2h32m, 38.44% done, 4h3m to go
config:

        NAME        STATE     READ WRITE CKSUM
        data1       ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c1t0d0  ONLINE       0     0     0
            c1t5d0  ONLINE       0     0     0
            c2t2d0  ONLINE       0     0     0
            c2t7d0  ONLINE       0     0     0
            c3t4d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c4t6d0  ONLINE       0     0     0
            c5t3d0  ONLINE       0     0     0
            c6t0d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c1t1d0  ONLINE       0     0     0
            c1t6d0  ONLINE       0     0     0
            c2t3d0  ONLINE       0     0     0
            c6t5d0  ONLINE       0     0     0  310G resilvered  <-- they were spares
            c3t5d0  ONLINE       0     0     0
            c4t2d0  ONLINE       0     0     0
            c4t7d0  ONLINE       0     0     0
            c6t7d0  ONLINE       0     0     0  310G resilvered   <-- they were spares
            c6t1d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c1t2d0  ONLINE       0     0     0
            c1t7d0  ONLINE       0     0     0
            c2t4d0  ONLINE       0     0     0
            c3t1d0  ONLINE       0     0     0
            c3t6d0  ONLINE       0     0     0
            c4t3d0  ONLINE       0     0     0
            c5t0d0  ONLINE       0     0     0
            c5t5d0  ONLINE       0     0     0
            c6t2d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c1t3d0  ONLINE       0     0     0
            c2t0d0  ONLINE       0     0     0
            c2t5d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     0
            c3t7d0  ONLINE       0     0     0
            c4t4d0  ONLINE       0     0     0
            c5t1d0  ONLINE       0     0     0
            c5t6d0  ONLINE       0     0     0
            c6t3d0  ONLINE       0     0     0
          raidz2    ONLINE       0     0     0
            c1t4d0  ONLINE       0     0     0
            c2t1d0  ONLINE       0     0     0
            c2t6d0  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
            c4t5d0  ONLINE       0     0     0
            c5t2d0  ONLINE       0     0     0
            c5t7d0  ONLINE       0     0     0
            c6t4d0  ONLINE       0     0     0
        spares
          c6t6d0    AVAIL   
          c3t0d0    AVAIL   <-- they were pool disks, I swapped both because they were broken
          c5t4d0    AVAIL   <-- they were pool disks   

errors: Permanent errors have been detected in the following files:

        /data1/t3fs11_cms_1/data/000002CC181F180E40F593A9313C8EAC5269   <--- I removed all of these files
        /data1/t3fs11_cms_1/data/00004F74157CEDC241418C7ECD9A495EDC10
        /data1/t3fs11_cms/data/000048FE1CBF54F14D4FA68AB53A5F54F21E
        /data1/t3fs11_cms_1/data/0000F54388A555E542EFA8015AB2C86C9F40
        /data1/t3fs11_cms_1/data/00002231135E4131437B81CEE1BEFC39978E
        /data1/t3fs11_cms/data/0000E748611DBADE4505942C25077055E179

  pool: rpool
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        rpool       ONLINE       0     0     0
          c0d0s0    ONLINE       0     0     0

errors: No known data errors
root@t3fs11 $

ZFS/FMD got disabled, never happened !

root@t3fs11 $ fmadm faulty 
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Aug 05 18:02:26 91b71d11-0d46-62ed-9e8b-b6df1c5ff285  FMD-8000-2K    Minor    

Host        : t3fs11
Platform    : Sun Fire X4540    Chassis_id  : 0947AMR033              

Fault class : defect.sunos.fmd.module
Affects     : fmd:///module/zfs-diagnosis
                  faulted but still in service

Description : A Solaris Fault Manager component has experienced an error that
              required the module to be disabled.  Refer to
              http://sun.com/msg/FMD-8000-2K for more information.

Response    : The module has been disabled.  Events destined for the module
              will be saved for manual diagnosis.

Impact      : Automated diagnosis and response for subsequent events associated
              with this module will not occur.

Action      : Use fmdump -v -u  to locate the module.  Use fmadm
              reset  to reset the module.

root@t3fs11 $ fmadm reset zfs-diagnosis          
fmadm: zfs-diagnosis module has been reset

root@t3fs11 $ fmadm config
MODULE                   VERSION STATUS  DESCRIPTION
cpumem-retire            1.1     active  CPU/Memory Retire Agent
disk-monitor             1.0     active  Disk Monitor
disk-transport           1.0     active  Disk Transport Agent
eft                      1.16    active  eft diagnosis engine
fabric-xlate             1.0     active  Fabric Ereport Translater
fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
io-retire                1.0     active  I/O Retire Agent
snmp-trapgen             1.0     active  SNMP Trap Generation Agent
sp-monitor               1.0     active  Service Processor Monitor
sysevent-transport       1.0     active  SysEvent Transport Agent
syslog-msgs              1.0     active  Syslog Messaging Agent
zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
zfs-retire               1.0     active  ZFS Retire Agent

Topic revision: r54 - 2013-08-27 - DerekFeichtinger

CmsTier3

User Pages
Main Page
Policies

Physics Groups
Steering Board Meetings

Admin Pages
AdminArea
Cluster Specs