[Beowulf] HD undetectable errors

Henning Fehrmann henning.fehrmann at aei.mpg.de
Fri Aug 21 06:25:32 PDT 2009


Hello,

a typical rate for data not recovered in a read operation on a HD is
1 per 10^15 bit reads.

If one fills a 100TByte file server the probability of loosing data
is of the order of 1.
Off course, one could circumvent this problem by using RAID5 or RAID6.
Most of the controller do not check the parity if they read data and
here the trouble begins. 
I can't recall the rate for undetectable errors but this might be few
orders of magnitude smaller than 1 per 10^15 bit reads. However, given 
the fact that one deals nowadays with few hundred TBytes of data this
might happen from time to time without being realized.  

One could lower the rate by forcing the RAID controller to check the
parity information in a read process. Are there RAID controller which
are able to perform this? 
Another solution might be the useage of file systems which have additional
checksums for the blocks like zfs or qfs. This even prevents data
corruption due to undetected bit flips on the bus or the RAID
controller.
Does somebody know the size of the checksum and the rate of undetected
errors for qfs?
For zfs it is 256 bit per 512Byte data.
One option is the fletcher2 algorithm to compute the checksum.
Does somebody know the rate of undetectable bit flips for such a
setting?

Are there any other file systems doing block-wise checksumming?


Thank you,
Henning Fehrmann



More information about the Beowulf mailing list