(r6) LustreDiskChangeProcedures < LCGTier2

If you find that a disk has died on an OSS, here is the procedure:

1st, you need to find out which disk is the one that went bad:

[root@oss31 ~]# grep -C2 _ /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md17 : active raid6 sdcf[0] sdco[9] sdcn[8] sdcm[7] sdcl[6] sdck[5] sdcj[10](F) sdci[3] sdch[2] sdcg[1]
      3907091456 blocks level 6, 128k chunk, algorithm 2 [10/9] [UUUU_UUUUU]
                in: 24771488 reads, 38246047 writes; out: 3561687537 reads, 1769535592 writes
                2236207063 in raid5d, 131583 out of stripes, 3545221835 handle called

[root@oss31 ~]# fdisk -l /dev/sdcj
[root@oss31 ~]#

Here, we see that sdcj is not working, the system knows nothing about it any more.

(from /var/log/messages)

May 14 13:27:44 oss31 kernel: scsi 2:0:39:0: rejecting I/O to dead device
May 14 13:27:44 oss31 last message repeated 2 times
May 14 13:27:44 oss31 kernel: raid5: Disk failure on sdcj, disabling device. Operation continuing on 9 devices
May 14 13:27:44 oss31 kernel: scsi 2:0:39:0: rejecting I/O to dead device
May 14 13:27:44 oss31 kernel: raid5:md17: read error not correctable (sector 821658128 on sdcj)

[root@oss31 ~]# mdadm --detail md17
mdadm: cannot open md17: No such file or directory
[root@oss31 ~]# mdadm --detail /dev/md17
/dev/md17:
        Version : 0.90
  Creation Time : Tue Mar  2 18:03:16 2010
     Raid Level : raid6
     Array Size : 3907091456 (3726.09 GiB 4000.86 GB)
  Used Dev Size : 488386432 (465.76 GiB 500.11 GB)
   Raid Devices : 10
  Total Devices : 10
Preferred Minor : 17
    Persistence : Superblock is persistent

  Intent Bitmap : /lustre/scratch/bmp17/bitmap

    Update Time : Tue May 18 10:19:07 2010
          State : clean, degraded
 Active Devices : 9
Working Devices : 9
 Failed Devices : 1
  Spare Devices : 0

     Chunk Size : 128K

           UUID : 22affd18:9e22b048:7d19f379:8ed3dc17
         Events : 0.419992

    Number   Major   Minor   RaidDevice State
       0      69       48        0      active sync   /dev/sdcf
       1      69       64        1      active sync   /dev/sdcg
       2      69       80        2      active sync   /dev/sdch
       3      69       96        3      active sync   /dev/sdci
       4       0        0         4      removed
       5      69      128        5      active sync   /dev/sdck
       6      69      144        6      active sync   /dev/sdcl
       7      69      160        7      active sync   /dev/sdcm
       8      69      176        8      active sync   /dev/sdcn
       9      69      192        9      active sync   /dev/sdco

      10      69      112        -      faulty spare

Now, to make sure the system has truly disabled the disk, we need to run some more commands:

[root@oss31 ~]# mdadm -f /dev/md17 /dev/sdcj
mdadm: cannot find /dev/sdcj: No such file or directory

Since the system has disabled the device, mdadm is not aware of it anymore either.

Next, remove the faulted device from the array, since it didn't let go of it yet:

[root@oss31 ~]# mdadm /dev/md17 --remove failed
mdadm: hot removed 69:112

Now we need to make sure that it is truly disabled. This is done by echoing a string into /proc/scsi/scsi. First, however, you need to know the scsi address of the disk, which we can see in the above output in the messages log:

oss31 kernel: scsi 2:0:39:0: rejecting I/O to dead device

This means our command will look like this:

[root@oss31 ~]# echo "scsi remove-single-device" 2 0 39 0 > /proc/scsi/scsi
-bash: echo: write error: No such device or address

Now we can be confident that if we remove the disk, nothing ugly should happen.

Now, we physically replace the disk with a new one.

When you get back to your terminal, check out dmesg to see what the system thinks the new disk is:

[root@oss31 ~]# dmesg|tail -14
mptsas: ioc2: attaching sata device, channel 0, id 14, phy 14
  Vendor: ATA       Model: HITACHI HUA7250S  Rev: AC4A
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sdct: 976773168 512-byte hdwr sectors (500108 MB)
sdct: Write Protect is off
sdct: Mode Sense: 73 00 00 08
SCSI device sdct: drive cache: write through
SCSI device sdct: 976773168 512-byte hdwr sectors (500108 MB)
sdct: Write Protect is off
sdct: Mode Sense: 73 00 00 08
SCSI device sdct: drive cache: write through
 sdct: unknown partition table
sd 2:0:50:0: Attached scsi disk sdct
sd 2:0:50:0: Attached scsi generic sg93 type 0

So now we know that the new disk is /dev/sdct. We don't reference the disks by their sd names, rather by their unique disk-ids that are provided by the disk manufacturer. Now we need to add the new disk to the original raid, as well as change the /etc/mdadm.conf.oss and /dev/mdadm.conf.local. This is done best in this manner:

[root@oss31 ~]# for i in /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJYZF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6G6M6XF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK9MF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJEGF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJU7F
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK5LF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJ6BF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJZPF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK0WF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJX2F;do echo $i;ls -l $i;done

/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJYZF
lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJYZF -> ../../sdcf
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6G6M6XF
lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6G6M6XF -> ../../sdcg
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK9MF
lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK9MF -> ../../sdch
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJEGF
lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJEGF -> ../../sdci
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJU7F
ls: /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJU7F: No such file or directory
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK5LF
lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK5LF -> ../../sdck
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJ6BF
lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJ6BF -> ../../sdcl
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJZPF
lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJZPF -> ../../sdcm
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK0WF
lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK0WF -> ../../sdcn
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJX2F
lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJX2F -> ../../sdco

The list in the for command is taken from the md17 DEVICE definition in the mdadm.conf.oss. So now we know that the old, bad disk was the one we can't see now, where it says 'No such file or directory' above - the string ending in XJU7F. Now we will replace that device with the new disk id, which can be determined this way:

[root@oss31 ~]# ls -l /dev/disk/by-id/|grep sdct
lrwxrwxrwx 1 root root 10 May 18 10:56 scsi-SATA_HITACHI_HUA7250GTF402P6GS4Z4F -> ../../sdct

Now we want to replace the device in both of the mdadm conf files.

DO THIS ON BOTH OSS SERVERS - both mdadm.conf.local and mdam.conf.oss on the server it failed from, AND the mdadm.conf.oss on the OTHER OSS. DO NOT FAIL TO DO THIS!!!

OLD:
#/dev/md17
DEVICE dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJYZF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6G6M6XF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK9MF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJEGF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJU7F /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK5LF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJ6BF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJZPF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK0WF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJX2F

NEW:
#/dev/md17
DEVICE /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJYZF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6G6M6XF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK9MF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJEGF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GS4Z4F /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK5LF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJ6BF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJZPF
/dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK0WF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJX2F

Now we will add the device to the array:

[root@oss31 ~]# mdadm /dev/md17 -a /dev/sdct
mdadm: added /dev/sdct

Next, we want to check out what is happening in /proc/mdstat

[root@oss31 ~]# grep recovery /proc/mdstat
      [>....................]  recovery =  2.2% (10960204/488386432) finish=126.8min speed=62742K/sec

There you go.

Fotis & Peter suggest as alternatives to the above, on the broken OSS do:

 
# ./RunmeWhenDiskHasBeenReplaced.sh
You should now run on both OSSs: ./RunmeWhenDiskHasBeenReplaced.sh scsi-SATA_HITACHI_HUA7250GTF402P6GXK0HF scsi-SATA_HITACHI_HUA7250GTF402P6GXJT9F

And just do what it tells you wink