Tags:
tag this topic
create new tag
view all tags
If you find that a disk has died on an OSS, here is the procedure: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> </blockquote> 1st, you need to find out which disk is the one that went bad: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim>[root@oss31 ~]# grep -C2 _ /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md17 : active raid6 sdcf[0] sdco[9] sdcn[8] sdcm[7] sdcl[6] sdck[5] sdcj[10](F) sdci[3] sdch[2] sdcg[1] 3907091456 blocks level 6, 128k chunk, algorithm 2 [10/9] [UUUU_UUUUU] in: 24771488 reads, 38246047 writes; out: 3561687537 reads, 1769535592 writes 2236207063 in raid5d, 131583 out of stripes, 3545221835 handle called [root@oss31 ~]# fdisk -l /dev/sdcj [root@oss31 ~]# </verbatim> </blockquote> Here, we see that sdcj is not working, the system knows nothing about it any more. <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim> (from /var/log/messages) May 14 13:27:44 oss31 kernel: scsi 2:0:39:0: rejecting I/O to dead device May 14 13:27:44 oss31 last message repeated 2 times May 14 13:27:44 oss31 kernel: raid5: Disk failure on sdcj, disabling device. Operation continuing on 9 devices May 14 13:27:44 oss31 kernel: scsi 2:0:39:0: rejecting I/O to dead device May 14 13:27:44 oss31 kernel: raid5:md17: read error not correctable (sector 821658128 on sdcj)</verbatim> </blockquote> <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim>[root@oss31 ~]# mdadm --detail md17 mdadm: cannot open md17: No such file or directory [root@oss31 ~]# mdadm --detail /dev/md17 /dev/md17: Version : 0.90 Creation Time : Tue Mar 2 18:03:16 2010 Raid Level : raid6 Array Size : 3907091456 (3726.09 GiB 4000.86 GB) Used Dev Size : 488386432 (465.76 GiB 500.11 GB) Raid Devices : 10 Total Devices : 10 Preferred Minor : 17 Persistence : Superblock is persistent Intent Bitmap : /lustre/scratch/bmp17/bitmap Update Time : Tue May 18 10:19:07 2010 State : clean, degraded Active Devices : 9 Working Devices : 9 Failed Devices : 1 Spare Devices : 0 Chunk Size : 128K UUID : 22affd18:9e22b048:7d19f379:8ed3dc17 Events : 0.419992 Number Major Minor RaidDevice State 0 69 48 0 active sync /dev/sdcf 1 69 64 1 active sync /dev/sdcg 2 69 80 2 active sync /dev/sdch 3 69 96 3 active sync /dev/sdci 4 0 0 4 removed 5 69 128 5 active sync /dev/sdck 6 69 144 6 active sync /dev/sdcl 7 69 160 7 active sync /dev/sdcm 8 69 176 8 active sync /dev/sdcn 9 69 192 9 active sync /dev/sdco 10 69 112 - faulty spare</verbatim> </blockquote> Now, to make sure the system has truly disabled the disk, we need to run some more commands: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim> [root@oss31 ~]# mdadm -f /dev/md17 /dev/sdcj mdadm: cannot find /dev/sdcj: No such file or directory </verbatim> </blockquote> Since the system has disabled the device, mdadm is not aware of it anymore either. Next, remove the faulted device from the array, since it didn't let go of it yet: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim> [root@oss31 ~]# mdadm /dev/md17 --remove failed mdadm: hot removed 69:112 </verbatim> </blockquote> Now we need to make sure that it is truly disabled. This is done by echoing a string into /proc/scsi/scsi. First, however, you need to know the scsi address of the disk, which we can see in the above output in the messages log: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim>oss31 kernel: scsi 2:0:39:0: rejecting I/O to dead device</verbatim> </blockquote> This means our command will look like this: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim>[root@oss31 ~]# echo "scsi remove-single-device" 2 0 39 0 > /proc/scsi/scsi -bash: echo: write error: No such device or address </verbatim> </blockquote> Now we can be confident that if we remove the disk, nothing ugly should happen. Now, we physically replace the disk with a new one. When you get back to your terminal, check out dmesg to see what the system thinks the new disk is: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim> [root@oss31 ~]# dmesg|tail -14 mptsas: ioc2: attaching sata device, channel 0, id 14, phy 14 Vendor: ATA Model: HITACHI HUA7250S Rev: AC4A Type: Direct-Access ANSI SCSI revision: 05 SCSI device sdct: 976773168 512-byte hdwr sectors (500108 MB) sdct: Write Protect is off sdct: Mode Sense: 73 00 00 08 SCSI device sdct: drive cache: write through SCSI device sdct: 976773168 512-byte hdwr sectors (500108 MB) sdct: Write Protect is off sdct: Mode Sense: 73 00 00 08 SCSI device sdct: drive cache: write through sdct: unknown partition table sd 2:0:50:0: Attached scsi disk sdct sd 2:0:50:0: Attached scsi generic sg93 type 0 </verbatim> </blockquote> if there is no indication from the system that a disk was inserted, run this command, then look at dmesg again: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim> echo "scsi add-single-device" 1 0 33 0 > /proc/scsi/scsi </verbatim> </blockquote> So now we know that the new disk is /dev/sdct. We don't reference the disks by their sd names, rather by their unique disk-ids that are provided by the disk manufacturer. Now we need to add the new disk to the original raid, as well as change the /etc/mdadm.conf.oss and /dev/mdadm.conf.local. This is done best in this manner: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim> [root@oss31 ~]# for i in /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJYZF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6G6M6XF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK9MF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJEGF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJU7F /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK5LF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJ6BF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJZPF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK0WF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJX2F;do echo $i;ls -l $i;done /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJYZF lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJYZF -> ../../sdcf /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6G6M6XF lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6G6M6XF -> ../../sdcg /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK9MF lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK9MF -> ../../sdch /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJEGF lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJEGF -> ../../sdci /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJU7F ls: /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJU7F: No such file or directory /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK5LF lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK5LF -> ../../sdck /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJ6BF lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJ6BF -> ../../sdcl /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJZPF lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJZPF -> ../../sdcm /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK0WF lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK0WF -> ../../sdcn /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJX2F lrwxrwxrwx 1 root root 10 May 10 15:59 /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJX2F -> ../../sdco </verbatim> </blockquote> The list in the for command is taken from the md17 DEVICE definition in the mdadm.conf.oss. So now we know that the old, bad disk was the one we can't see now, where it says 'No such file or directory' above - the string ending in XJU7F. Now we will replace that device with the new disk id, which can be determined this way: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim> [root@oss31 ~]# ls -l /dev/disk/by-id/|grep sdct lrwxrwxrwx 1 root root 10 May 18 10:56 scsi-SATA_HITACHI_HUA7250GTF402P6GS4Z4F -> ../../sdct </verbatim> </blockquote> Now we want to replace the device in both of the mdadm conf files. <span style="color: #dc143c;">DO THIS ON BOTH OSS SERVERS - both mdadm.conf.local and mdam.conf.oss on the server it failed from, AND the mdadm.conf.oss on the OTHER OSS. DO NOT FAIL TO DO THIS!!! </span> <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim> OLD: #/dev/md17 DEVICE dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJYZF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6G6M6XF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK9MF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJEGF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJU7F /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK5LF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJ6BF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJZPF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK0WF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJX2F NEW: #/dev/md17 DEVICE /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJYZF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6G6M6XF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK9MF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJEGF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GS4Z4F /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK5LF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJ6BF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJZPF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXK0WF /dev/disk/by-id/scsi-SATA_HITACHI_HUA7250GTF402P6GXJX2F </verbatim> </blockquote> Another way of doing this is: * host where it failed: <verbatim>$grep ___OLD_DISK_ID___ /etc/mdadm.conf.* $ sed -i 's/____OLD_DISK_ID___/____NEW___DISK_ID!!!!!!______/g' /etc/mdadm.conf.local $ sed -i 's/____OLD_DISK_ID___/____NEW___DISK_ID!!!!!!______/g' /etc/mdadm.conf.oss</verbatim> * other host brother to the one in which it failed: <verbatim>$ grep ___OLD_DISK_ID___ /etc/mdadm.conf.* $ sed -i 's/____OLD_DISK_ID___/____NEW___DISK_ID!!!!!!______/g' /etc/mdadm.conf.oss</verbatim> Make sure there is only 1 entry for that ID on each file!!! So, for example, if the failed device was =/dev/sdaj= in =oss11=, the OLD_ID was =scsi-SATA_HITACHI_HUA7250GTF402P6GXJTDF= and the NEW_ID was =scsi-SATA_HITACHI_HUA7250GTF402P6G6W9KF=, we would have to do this:<verbatim>[root@oss11:~]# cp /etc/mdadm.conf.local ./bck.mdadm.conf.local [root@oss11:~]# cp /etc/mdadm.conf.oss ./bck.mdadm.conf.oss [root@oss11:~]# sed -i 's/scsi-SATA_HITACHI_HUA7250GTF402P6GXJTDF/scsi-SATA_HITACHI_HUA7250GTF402P6G6W9KF/g' /etc/mdadm.conf.local [root@oss11:~]# sed -i 's/scsi-SATA_HITACHI_HUA7250GTF402P6GXJTDF/scsi-SATA_HITACHI_HUA7250GTF402P6G6W9KF/g' /etc/mdadm.conf.oss [root@oss12:~]# cp /etc/mdadm.conf.oss ./bck.mdadm.conf.oss [root@oss12:~]# sed -i 's/scsi-SATA_HITACHI_HUA7250GTF402P6GXJTDF/scsi-SATA_HITACHI_HUA7250GTF402P6G6W9KF/g' /etc/mdadm.conf.oss</verbatim> Now we will add the device to the array: <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim> [root@oss31 ~]# mdadm /dev/md17 -a /dev/sdct mdadm: added /dev/sdct </verbatim> </blockquote> Next, we want to check out what is happening in /proc/mdstat <blockquote style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 40px; border-width: initial; border-color: initial; border-style: none; padding: 0px"> <verbatim> [root@oss31 ~]# grep recovery /proc/mdstat [>....................] recovery = 2.2% (10960204/488386432) finish=126.8min speed=62742K/sec </verbatim> </blockquote> There you go. Fotis & Peter suggest as alternatives to the above, on the broken OSS do: <blockquote> <verbatim> # ./RunmeWhenDiskHasBeenReplaced.sh You should now run on both OSSs: ./RunmeWhenDiskHasBeenReplaced.sh scsi-SATA_HITACHI_HUA7250GTF402P6GXK0HF scsi-SATA_HITACHI_HUA7250GTF402P6GXJT9F </verbatim> </blockquote> And just do what it tells you ;-) -- Main.JasonTemple - 2010-05-17
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r9
<
r8
<
r7
<
r6
<
r5
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r9 - 2011-06-24
-
MiguelGila
LCGTier2
Log In
(Topic)
LCGTier2 Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Users
Entry point / Contact
RoadMap
ATLAS Pages
CMS Pages
CMS User Howto
CHIPP CB
Outreach
Technical
Cluster details
Services
Hardware and OS
Tools & Tips
Monitoring
Logs
Maintenances
Meetings
Tests
Issues
Blog
Home
Site map
CmsTier3 web
LCGTier2 web
PhaseC web
Main web
Sandbox web
TWiki web
LCGTier2 Web
Users
Groups
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Warning: Can't find topic "".""
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback