mdadm raid 1 disk failure recovery:

1- identify disk that failed. (emails were sent in my case)
mdadm monitoring 
show details Dec 4 (4 days ago)
This is an automatically generated mail message from mdadm
running on aapsan01.billingconsultants.local

A Fail event had been detected on md device /dev/md0.

It could be related to component device /dev/sda3.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
     104320 blocks [2/2] [UU]

md0 : active raid1 sdb3[1] sda3[0](F)
     4064320 blocks [2/1] [_U]

md3 : active raid1 sdb5[1] sda5[0]
     1928860160 blocks [2/2] [UU]

md2 : active raid1 sdb2[1] sda2[2](F)
     20482752 blocks [2/1] [_U]

unused devices: <none>
unused devices: <none>

the underscore represents the failed side. So, if _U then the left(sdb above) is failed if U_, then right (sda). That is confused by a SMART email message showing a bunch of fails on sda:

root 
show details Dec 3 (2 days ago)
This email was generated by the smartd daemon running on:

  host name: aapsan01.billingconsultants.local
 DNS domain: [Unknown]
 NIS domain: (none)

The following warning/error was logged by the smartd daemon:

Device: /dev/sda, 498 Currently unreadable (pending) sectors
- Hide quoted text -

For details see host's SYSLOG (default: /var/log/messages).

You can also use the smartctl utility for further investigation.
No additional email messages about this problem will be sent.

next, you need to locate the s/n. To find that, run lshal and redirect the output to a log file that you can search. Search for "sda" and that block will list the s/n

lshal > /mnt/apvdbs03/public/debug/lshal2.log
  
udi = '/org/freedesktop/Hal/devices/storage_serial_SATA_WDC_WD20EADS_00_WD_WCAVY1249038'
  volume.ignore = true  (bool)
  block.storage_device = '/org/freedesktop/Hal/devices/storage_serial_SATA_WDC_WD20EADS_00_WD_WCAVY1249038'  (string)
  info.udi = '/org/freedesktop/Hal/devices/storage_serial_SATA_WDC_WD20EADS_00_WD_WCAVY1249038'  (string)
  storage.partitioning_scheme = 'mbr'  (string)
  storage.removable.media_size = 2000398934016  (0x1d1c1116000)  (uint64)
  storage.requires_eject = false  (bool)
  storage.hotpluggable = false  (bool)
  info.capabilities = {'storage', 'block'} (string list)
  info.category = 'storage'  (string)
  info.product = 'WDC WD20EADS-00S'  (string)
  info.vendor = 'ATA'  (string)
  storage.size = 2000398934016  (0x1d1c1116000)  (uint64)
  storage.removable = false  (bool)
  storage.removable.media_available = true  (bool)
  storage.physical_device = '/org/freedesktop/Hal/devices/pci_8086_27c0'  (string)
  storage.lun = 0  (0x0)  (int)
  storage.firmware_version = '01.0'  (string)
  storage.serial = 'SATA_WDC_WD20EADS-00_WD-WCAVY1249038'  (string)
  storage.vendor = 'ATA'  (string)
  storage.model = 'WDC WD20EADS-00S'  (string)
  storage.drive_type = 'disk'  (string)
  storage.automount_enabled_hint = true  (bool)
  storage.media_check_enabled = false  (bool)
  storage.no_partitions_hint = false  (bool)
  storage.bus = 'pci'  (string)
  block.is_volume = false  (bool)
  block.minor = 0  (0x0)  (int)
  block.major = 8  (0x8)  (int)
  block.device = '/dev/sda'  (string)
  linux.hotplug_type = 3  (0x3)  (int)
  info.parent = '/org/freedesktop/Hal/devices/pci_8086_27c0_scsi_host_0_scsi_device_lun0'  (string)
  linux.sysfs_path_device = '/sys/block/sda'  (string)
  linux.sysfs_path = '/sys/block/sda'  (string)

now, do a cat /proc/mdstat and mark all sda instances as failed (must be done to remove the disk)

[root@aapsan01 ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[0]
      104320 blocks [2/2] [UU]

md0 : active raid1 sdb3[1] sda3[2](F)
      4064320 blocks [2/1] [_U]

md3 : active raid1 sdb5[1] sda5[0]
      1928860160 blocks [2/2] [UU]

md2 : active raid1 sdb2[1] sda2[2](F)
      20482752 blocks [2/1] [_U]

unused devices: <none>
[root@aapsan01 ~]#

md3 and md1's sda instances need to be marked (F) for failed:


[root@aapsan01 ~]# mdadm --fail /dev/md1 /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md1
[root@aapsan01 ~]# mdadm --fail /dev/md3 /dev/sda5
mdadm: set /dev/sda5 faulty in /dev/md3

let's see if that worked:

[root@aapsan01 ~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdb1[1] sda1[2](F)
      104320 blocks [2/1] [_U]

md0 : active raid1 sdb3[1] sda3[2](F)
      4064320 blocks [2/1] [_U]

md3 : active raid1 sdb5[1] sda5[0](F)
      1928860160 blocks [2/1] [_U]

md2 : active raid1 sdb2[1] sda2[2](F)
      20482752 blocks [2/1] [_U]

unused devices: <none>

now, let's remove the sda instances from the array (sda was the good one, I should've removed sdb. This put me back a few hours, but I figured it out later. You can avoid this mistake by paying special attention to the _ underscore location in the cat /proc/mdstat output and s/n info):

[root@aapsan01 ~]# mdadm --remove /dev/md0 /dev/sda3
mdadm: hot removed /dev/sda3
[root@aapsan01 ~]# mdadm --remove /dev/md1 /dev/sda1
mdadm: hot removed /dev/sda1
[root@aapsan01 ~]# mdadm --remove /dev/md3 /dev/sda5
mdadm: hot remove failed for /dev/sda5: Device or resource busy
[root@aapsan01 ~]# mdadm --remove /dev/md2 /dev/sda2
mdadm: hot removed /dev/sda2
[root@aapsan01 ~]# mdadm --remove /dev/md3 /dev/sda5
mdadm: hot remove failed for /dev/sda5: Device or resource busy
[root@aapsan01 ~]# mdadm --remove /dev/md3 /dev/sda5
mdadm: hot remove failed for /dev/sda5: Device or resource busy
[root@aapsan01 ~]# mdadm --remove /dev/md3 /dev/sda5
mdadm: hot removed /dev/sda5
[root@aapsan01 ~]#

notice we had to try a few times to get sda5 out. SOmetimes it takes several minutes...

-Next take out the bad drive and install the new (needs to be identical, or larger).
-I had to use systemrescuecd to get to a prompt and do the recovery. you may not need to do that
-fdisk /dev/sdy (y=whatever the new disk is) follow prompts to delete
-transfer partition scheme from old to new
 -sfdisk -d /dev/sdx > mbr_sdx.txt 
 -sfdisk -d /dev/sdy < mbr_sdx.txt
-add new disk partistions to the raid array with: mdadm /dev/mdn -a /dev/sdy  !n=the number of the array 1, 2, 3...
  -the above command took >5 hours on a 2tb drive. check status with cat /proc/mdstat
-reboot when done