[Beowulf] Surviving a double disk failure

Joe Landman landman at scalableinformatics.com
Fri Apr 10 05:18:03 PDT 2009

Stuart Midgley wrote:

Good work Stuart!

> What are the lessons learnt? Well with software raid Linux is both your 

1) Use RAID6.  It is your friend.  RAID5 is unashamedly your enemy.

2) Scrub early, scrub often.  We cron this ~1/week on Delta-V's (sounds 
similar to your box).

3) pay attention to any/every error.  Disk keeps giving you errors, toss it.

> friend and enemy. The behaviour of md got us in this mess. When md gets 
> an error on read it recovers the data from the other disks and re-writes 
> the blocks to the failed disk hoping the disk will reallocate. You do 
> get a warning saying that md encountered a recoverable error. So you 
> think it is ok. BUT the disk still failed on read and you haven't 
> swapped it out. Some time later when another disk fails hard and you get 
> a failed read on your other dodgy disk md sees 2 failed disks. And it's 
> all over.

This is why RAID6 is your friend.  Aside from this, the scrubbing mode 
of MD (would require a later kernel, bug me offline if you want to try 
one), is a lifesaver.

This and the later versions of the md tools.  The kernel, drivers, and 
tools with your distro are *ancient* by most standards.

> My advice:  don't let Linux collude with the disk vendors and reduce 

heh ...

> your reliability. Swap any disk that gets a correctable error on read.   
> Reallocation on write is fine not on read. The disk has failed.

add to this:

4) scheduled scrubbing to specifically detect these errors.  Turn on 
error correction bits for scrub to force it to try to correct errors.

Glad you were able to get your data back.


Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

More information about the Beowulf mailing list