[Beowulf] HD undetectable errors

Tue Aug 25 13:37:57 PDT 2009

Not an expert on this.... some thoughts below.
On Fri, Aug 21, 2009 at 03:25:32PM +0200, Henning Fehrmann wrote:
> Hello,
> 
> a typical rate for data not recovered in a read operation on a HD is
> 1 per 10^15 bit reads.
> 
> If one fills a 100TByte file server the probability of loosing data
> is of the order of 1.
> Off course, one could circumvent this problem by using RAID5 or RAID6.
> Most of the controller do not check the parity if they read data and
> here the trouble begins. 

> I can't recall the rate for undetectable errors but this might be few
> orders of magnitude smaller than 1 per 10^15 bit reads. However, given 
> the fact that one deals nowadays with few hundred TBytes of data this
> might happen from time to time without being realized.  
> 
> One could lower the rate by forcing the RAID controller to check the
> parity information in a read process. Are there RAID controller which
> are able to perform this? 
> Another solution might be the useage of file systems which have additional
> checksums for the blocks like zfs or qfs. This even prevents data
> corruption due to undetected bit flips on the bus or the RAID
> controller.
> Does somebody know the size of the checksum and the rate of undetected
> errors for qfs?
> For zfs it is 256 bit per 512Byte data.
> One option is the fletcher2 algorithm to compute the checksum.
> Does somebody know the rate of undetectable bit flips for such a
> setting?
> 
> Are there any other file systems doing block-wise checksumming?

I do not think you have the statistics correct but the issue is very real.

There are many archival and site policies that add their own check
sum and error recovery codes to their archives because of the value or
sensitivity of the data.

All disks I know of have a CRC/ECC code on the media that is checked
at read time by hardware,  Seagate says one 512 byte sector in 10^16
reads error rate.  The RAID however cannot recheck its parity without
re-reading all the spindles and recomputing+check of the parity, which
is slow, but it could.

However, adding the extra read does not solve the issue at two levels

  * Most RAID devices are designed to react to the disk's reported error 
	the 10^16 number is a value for undetected and unreported errors 
	thus the a RAID will not have it's redundancy mechanism triggered.

  * Most RAID designs would not be able to recover from an all spindle
	read and parity recompute+check that detected an error.   
	i.e. the redundancy in common RAIDs cannot discover which 
	of the devices presented bogus data.   And it is unknowable 
	if the error is a single bit or many bits.  In the simple mirror case
	when the data does not match -- which is correct, A or B?
	In most more complex RAID designs the same problem exists.
	In a triple redundant mirror case a majority could rule.

At single disk read speeds of 15MB/s one sector in 10^16 reads 
one error in +100year?   With a failure in time on the order of
100 years other issues would seem (to me) to dominate the reliability
of a storage system.   But statistics do generate unexpected results.

I do know of at least one site that has detected a single bit data storage
error in a multiple TB RAID that went undetected by hardware and the OS.
Compressed data makes this problem even more interesting because many of
the stream tools (encryption or compression) fail "badly" and depending
on where the bits flip a little or a LOT of data can be lost.

More to the point are the number of times the dice are rolled with data.
Network link, PCIe, Processor data paths, memory data paths, disk controller
data paths, device links, read data paths, write data paths....
Disks are the strong link in this data chain in way too many cases.

This question from above is interesting.
+ Does somebody know the size of the checksum and the rate of undetected
+ errors for qfs?

The error rate is not a simple function of qfs it is most likely a
function of the underlying error rate in the hardware involved in qfs.
Since QFS can extend its reach from disk to tape, to/from disk cache,
to optical to other... each media needs to be understood as well as the
statistics associated with all the hidden transfers.  With basic error
rate info for all the hardware that touches the data some swag on the
file system error rate and undetected error rates might begin.

I think the Seagate 10^16 number is simply the hash statistics for 
their ReedSolomon ECC/CRC length and 2^512 permutations of data not the error rate.
i.e. the quality of the code not the error rate of the device.

However, It does make sense to me to generate and maintain site specific meta
data for all valuable data files to include both detection (yes tamper
detection too) and recovery codes.  I would extend this to all data with
the hope that any problems might be seen first on inconsequential files.
Tripwire might be a good model for starting out on this.

I should note that the three 'big' error rate problems I have worked on
in the past 25 years had their root cause in an issue not understood
or considered at design time so empirical data from the customer was
critical.  Data sheets and design document conclusions just missed the issue.
These experiences taught me to be cautious with storage statistics.

Looming in the dark clouds is a need for owning your own data integrity.
It seems obvious to me in the growing business of cloud computing and cloud storage
that you need to "trust but verify" the integrity of your data.   My thought
on this is that external integrity methods are critical in the future.

And do remember that "parity is for farmers."

-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?
a