mdadm raid 1 disk failure recovery: 1- identify disk that failed. (emails were sent in my case) mdadm monitoring show details Dec 4 (4 days ago) This is an automatically generated mail message from mdadm running on aapsan01.billingconsultants.local A Fail event had been detected on md device /dev/md0. It could be related to component device /dev/sda3. Faithfully yours, etc. P.S. The /proc/mdstat file currently contains the following: Personalities : [raid1] md1 : active raid1 sdb1[1] sda1[0] 104320 blocks [2/2] [UU] md0 : active raid1 sdb3[1] sda3[0](F) 4064320 blocks [2/1] [_U] md3 : active raid1 sdb5[1] sda5[0] 1928860160 blocks [2/2] [UU] md2 : active raid1 sdb2[1] sda2[2](F) 20482752 blocks [2/1] [_U] unused devices: unused devices: the underscore represents the failed side. So, if _U then the left(sdb above) is failed if U_, then right (sda). That is confused by a SMART email message showing a bunch of fails on sda: root show details Dec 3 (2 days ago) This email was generated by the smartd daemon running on: host name: aapsan01.billingconsultants.local DNS domain: [Unknown] NIS domain: (none) The following warning/error was logged by the smartd daemon: Device: /dev/sda, 498 Currently unreadable (pending) sectors - Hide quoted text - For details see host's SYSLOG (default: /var/log/messages). You can also use the smartctl utility for further investigation. No additional email messages about this problem will be sent. next, you need to locate the s/n. To find that, run lshal and redirect the output to a log file that you can search. Search for "sda" and that block will list the s/n lshal > /mnt/apvdbs03/public/debug/lshal2.log udi = '/org/freedesktop/Hal/devices/storage_serial_SATA_WDC_WD20EADS_00_WD_WCAVY1249038' volume.ignore = true (bool) block.storage_device = '/org/freedesktop/Hal/devices/storage_serial_SATA_WDC_WD20EADS_00_WD_WCAVY1249038' (string) info.udi = '/org/freedesktop/Hal/devices/storage_serial_SATA_WDC_WD20EADS_00_WD_WCAVY1249038' (string) storage.partitioning_scheme = 'mbr' (string) storage.removable.media_size = 2000398934016 (0x1d1c1116000) (uint64) storage.requires_eject = false (bool) storage.hotpluggable = false (bool) info.capabilities = {'storage', 'block'} (string list) info.category = 'storage' (string) info.product = 'WDC WD20EADS-00S' (string) info.vendor = 'ATA' (string) storage.size = 2000398934016 (0x1d1c1116000) (uint64) storage.removable = false (bool) storage.removable.media_available = true (bool) storage.physical_device = '/org/freedesktop/Hal/devices/pci_8086_27c0' (string) storage.lun = 0 (0x0) (int) storage.firmware_version = '01.0' (string) storage.serial = 'SATA_WDC_WD20EADS-00_WD-WCAVY1249038' (string) storage.vendor = 'ATA' (string) storage.model = 'WDC WD20EADS-00S' (string) storage.drive_type = 'disk' (string) storage.automount_enabled_hint = true (bool) storage.media_check_enabled = false (bool) storage.no_partitions_hint = false (bool) storage.bus = 'pci' (string) block.is_volume = false (bool) block.minor = 0 (0x0) (int) block.major = 8 (0x8) (int) block.device = '/dev/sda' (string) linux.hotplug_type = 3 (0x3) (int) info.parent = '/org/freedesktop/Hal/devices/pci_8086_27c0_scsi_host_0_scsi_device_lun0' (string) linux.sysfs_path_device = '/sys/block/sda' (string) linux.sysfs_path = '/sys/block/sda' (string) now, do a cat /proc/mdstat and mark all sda instances as failed (must be done to remove the disk) [root@aapsan01 ~]# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb1[1] sda1[0] 104320 blocks [2/2] [UU] md0 : active raid1 sdb3[1] sda3[2](F) 4064320 blocks [2/1] [_U] md3 : active raid1 sdb5[1] sda5[0] 1928860160 blocks [2/2] [UU] md2 : active raid1 sdb2[1] sda2[2](F) 20482752 blocks [2/1] [_U] unused devices: [root@aapsan01 ~]# md3 and md1's sda instances need to be marked (F) for failed: [root@aapsan01 ~]# mdadm --fail /dev/md1 /dev/sda1 mdadm: set /dev/sda1 faulty in /dev/md1 [root@aapsan01 ~]# mdadm --fail /dev/md3 /dev/sda5 mdadm: set /dev/sda5 faulty in /dev/md3 let's see if that worked: [root@aapsan01 ~]# cat /proc/mdstat Personalities : [raid1] md1 : active raid1 sdb1[1] sda1[2](F) 104320 blocks [2/1] [_U] md0 : active raid1 sdb3[1] sda3[2](F) 4064320 blocks [2/1] [_U] md3 : active raid1 sdb5[1] sda5[0](F) 1928860160 blocks [2/1] [_U] md2 : active raid1 sdb2[1] sda2[2](F) 20482752 blocks [2/1] [_U] unused devices: now, let's remove the sda instances from the array (sda was the good one, I should've removed sdb. This put me back a few hours, but I figured it out later. You can avoid this mistake by paying special attention to the _ underscore location in the cat /proc/mdstat output and s/n info): [root@aapsan01 ~]# mdadm --remove /dev/md0 /dev/sda3 mdadm: hot removed /dev/sda3 [root@aapsan01 ~]# mdadm --remove /dev/md1 /dev/sda1 mdadm: hot removed /dev/sda1 [root@aapsan01 ~]# mdadm --remove /dev/md3 /dev/sda5 mdadm: hot remove failed for /dev/sda5: Device or resource busy [root@aapsan01 ~]# mdadm --remove /dev/md2 /dev/sda2 mdadm: hot removed /dev/sda2 [root@aapsan01 ~]# mdadm --remove /dev/md3 /dev/sda5 mdadm: hot remove failed for /dev/sda5: Device or resource busy [root@aapsan01 ~]# mdadm --remove /dev/md3 /dev/sda5 mdadm: hot remove failed for /dev/sda5: Device or resource busy [root@aapsan01 ~]# mdadm --remove /dev/md3 /dev/sda5 mdadm: hot removed /dev/sda5 [root@aapsan01 ~]# notice we had to try a few times to get sda5 out. SOmetimes it takes several minutes... -Next take out the bad drive and install the new (needs to be identical, or larger). -I had to use systemrescuecd to get to a prompt and do the recovery. you may not need to do that -fdisk /dev/sdy (y=whatever the new disk is) follow prompts to delete -transfer partition scheme from old to new -sfdisk -d /dev/sdx > mbr_sdx.txt -sfdisk -d /dev/sdy < mbr_sdx.txt -add new disk partistions to the raid array with: mdadm /dev/mdn -a /dev/sdy !n=the number of the array 1, 2, 3... -the above command took >5 hours on a 2tb drive. check status with cat /proc/mdstat -reboot when done