[Beowulf] MD check/scrub

Tue Nov 13 10:03:22 PST 2007

Leif Nixon wrote:
> Reconstruction. With raid 6, you can recover from single-disk
> corruption (As opposed to *failures*, where you get read errors from a
> disk. Raid 6 can handle two simultaneous disk *failures*.).
> 
> See section 4 in:
> 
> http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
> 

I just read it.

> Just recalculating the parity blocks does give you a consistent raid
> stripe, but destroys your data (unless it actually was one of the
> parity blocks that was corrupted).

Er, that's not how I read it at all.  To quote:

  In the case of data drive corruption, once the faulty drive has been 
identified, recover using the P drive in the same way as a one-disk erasure 
failure.

So you want to catch these single disk corruptions (data or parity) as soon
as possible so they don't accumulate.  In general if you have the redundancy 
at the software RAID it seems best not push too hard on the individual drive.
Don't retry excessively (and depend on the per block checksums) or allow long 
timeouts.  As soon as the error hits do a write (to remap the block), after 
all do you trust a drive to read the sector on the 10th time more than you
trust your parity calculations?  If the driver error rates gets too high drop 
the drive like a hot potato and scream bloody murder so the admin feeds you a 
disk asap.