Areca Raid array failed!

bzaayer · Nov 4, 2013

I have been thrown into a tough situation. We have a ARC1223-8i raid controller with a raid array and two raid 6 volumes. It looks like 3 of the 6 drives failed and have caused the volumes to fail.

I don't have much experience with Areca controllers. I don't want to do any permanent damage to the data. What should be my first step in recovering the array? Thank you for any help that can be provided.

JTY · Nov 4, 2013

Replace the failed disks, and restore from the most recent backup.

bexamous · Nov 4, 2013

Did 3 drives just suddenly fail? In which case its likely the drives aren't actually bad and the controller just thinks they are. If that is what happened then yeah there is probably some way to get the controller to assume those drives are good and just resync. I'd just call their support in the morning or google around.

If the first 2 drives filed long ago, and no one knew, and only now the 3rd drive failed the 3rd drive actually did fail and is bad, then there isn't much for you to do. Unless there was no backup and you're willing to spend lots and lots of money trying to recover data.

sub.mesa · Nov 4, 2013

Probably you are using Areca with non-TLER drives with uBER of 10^-14 or something like that. That is asking for trouble. Areca = legacy RAID.

Your data is not gone however, once you clear the bad sectors on the disks you can reattach the detached disk and at least get a degraded array. I recommend not to add more disks to prevent the RAID engine from starting a rebuild.

Carefully determine whether you want to continue using legacy RAID. Either make sure you have a real backup or convert to a modern ZFS setup instead.

As for recovery, you should probably begin with analysing the SMART data of the failed drives. In particular you want to know whether it has any bad sectors (Current Pending Sector) or cable defects (UDMA CRC Error Count).

uOpt · Nov 4, 2013

FWIW, over at 2cpu there's a bunch of Areca array reassembly artists hanging out. They can probably give you stop by step instructions.

bzaayer · Nov 4, 2013

I was able to get the array back into a degraded status by telling the raid card to activate the failed disk on the drive that most recently failed (drive 5). I then installed 4 new drives. 2 to rebuild and 2 hot spares. The array contains two volumes. The first volume rebuilt correctly. The second volume then began to rebuild when about 2 hours into the rebuild drive 5 failed again. Then for some unknown reason an engineer from there office removed the failed drive 5. They then called me. When I showed up the 1st volume (OS) was running correctly. Volume 2 (data) showed as failed. I then told the controller to activate failed disk 5. Unfortunately that disk now shows as free. Is there any way to get the second volume back? Thank you for any help!

Brian

uOpt · Nov 4, 2013

bzaayer said:
I was able to get the array back into a degraded status by telling the raid card to activate the failed disk on the drive that most recently failed (drive 5). I then installed 4 new drives. 2 to rebuild and 2 hot spares. The array contains two volumes. The first volume rebuilt correctly. The second volume then began to rebuild when about 2 hours into the rebuild drive 5 failed again. Then for some unknown reason an engineer from there office removed the failed drive 5. They then called me. When I showed up the 1st volume (OS) was running correctly. Volume 2 (data) showed as failed. I then told the controller to activate failed disk 5. Unfortunately that disk now shows as free. Is there any way to get the second volume back? Thank you for any help!

Brian

You really want to ask this over on 2cpu now before you continue writing to these drives.

Do you have a more specific error message from the disk 5? Was it a read error that couldn't be remapped? (as opposed to a write error)

bmh.01 · Nov 4, 2013

I'm assuming they/you don't have any backups? If the data is 110% important and there aren't any backups then stop doing anything and get the disks to a data recovery firm now. If you have backups then just hose the array and start again.

mwroobel · Nov 4, 2013

Brian - Before you do anything else to further damage your chances of recovery, post a complete log from the card, please tell us you know the original specs on the arrays (physical drive order, block and stripe size) and shut down the array until you get more help. The next step, before you do ANYTHING is to make a bit-perfect image of each drive in the array (using dd or some other mechanism) so you can always go back to where you are now (which after all that has been done may be up the creek) in recovery efforts. Finally, I will end this with my customary RAID !=BACKUP with the hope that people heed this advice in the future.

PS - email kevin at areca and look for Jus either here or at forums.2cpu.com, they are your best hope at this point other than professional (read expensive) recovery efforts.

bzaayer · Nov 4, 2013

uOpt said:
You really want to ask this over on 2cpu now before you continue writing to these drives.

Do you have a more specific error message from the disk 5? Was it a read error that couldn't be remapped? (as opposed to a write error)

Thank you for your help. I have also posted over at 2cpu. The specific error for disk 5 was a reading error.

I am in the process of imaging each drive. What a mess!

uOpt · Nov 4, 2013

bzaayer said:
Thank you for your help. I have also posted over at 2cpu. The specific error for disk 5 was a reading error.

I am in the process of imaging each drive. What a mess!

Right, so you are in the same stage that I found myself in lately.

Your array degraded due to some random errors and it has kicked out random drives. So far so good. But one or more of the remaining drives have "pending read errors", unreable blocks that nobody cared about lately but now you read the complete drive and they will cause an error, further degrading the array. If you had encountered these bad read blocks during normal operations the raid would have recovered the contents from the remaining drives, then overwrite the bad read block which would have caused the drive to re-map it elsewhere. But now it can't do that since it cannot recreate the block contents from the other drives (there are no redundant drives left).

The good new is that this is recoverable since you have no inconsistent writes (the filesystem went offline on a read), or at least no corruption in files not currently being written to. You will lose the contents of the bad block that caused the read error and you can even try to find out in which file it lives.

dd-copying the drives with the dd options to replace bad blocks with zeros is the way to go, keeping note of the blocks skipped. Then tell the array to just use the &*#(*(#(# drive. You can also reuse this drive if you give it a chance to remap those sectors but I don't know of a premade tool to react to a read error by writing those blocks (other than the raid systems (*)). Copying back from a dd copy will do it of course.

(*) why the idiot raid systems don't offer this to the user when they encounter a read error on an array one degradation away from failure is a mystery to me, BTW.

sub.mesa · Nov 6, 2013

Legacy RAID engines treat disks as either good or bad; nothing in between. A disk with terabytes of data but one tiny sector that cannot be read, is considered to be a failed disk from the viewpoint of the RAID engine.

Of course, that is why you should not use legacy RAID, but instead migrate to a next generation filesystem like ZFS. ZFS is virtually immune to bad sectors and all the trouble displayed in this thread.

uOpt · Nov 6, 2013

Did this get sorted? Looks like 2cpu didn't have the usual suspects showing up either.

jojo69 · Nov 6, 2013

another one bites the dust

Areca Raid array failed!

bzaayer

n00b

JTY

2[H]4U

bexamous

[H]ard|Gawd

sub.mesa

2[H]4U

uOpt

[H]ard|Gawd

bzaayer

n00b

uOpt

[H]ard|Gawd

bmh.01

Gawd

mwroobel

Supreme [H]ardness

bzaayer

n00b

uOpt

[H]ard|Gawd

sub.mesa

2[H]4U

uOpt

[H]ard|Gawd

jojo69

[H]F Junkie