[Beowulf] Surviving a double disk failure

Joe Landman landman at scalableinformatics.com
Fri Apr 10 10:42:27 PDT 2009


Michael Will wrote:
> 
> 
> raid6 is also new code with new bugs that can lead to dataloss as well, 
> regardless of its nice 'can survive
> two drive failures' feature. I have seen it happen.

All code (anywhere) can have bugs.  Arguing that raid6 module has bugs a 
non-sequitur.  It is well tested, and in use at a large and growing 
number of sites.

Raid6 is indeed younger than raid5 code in the kernel.  As the Raid6 
kernel was derived from the Raid5 code ...

I do agree that bugs can take down your storage.  That bad adapters or 
bad code, or bad drivers can (and do) result in damaged data.  Which is 
why frequent backups are so important.  Raid is not a backup (a favorite 
expression of mine).

This said, raid6 buys you a bit more time to solve your problem than 
raid5 does.  The google paper from 2 years ago notes that a second drive 
failure was well correlated with the first drive failure within 1000 
seconds, e.g. during the rebuild.  That second failure occurs, in a 
raid5 system, and you are (largely) toast, unless you go to the 
Herculean levels that Stuart went through.  Even then, you aren't 
guaranteed to get anything back.

The point being, raid6 may not be perfect, but it can likely stop a bad 
day from going really pear-shaped.  The statistics are against you 
surviving a failure, rather strongly, for RAID5 with large disk drives.



-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list