Disk reliability (Was: Node cloning)
kragen at pobox.com
kragen at pobox.com
Fri Jun 22 22:51:13 PDT 2001
Mark Hahn <hahn at coffee.psychology.mcmaster.ca> writes:
> > What size of CRCs are being used? If it's a 32-bit CRC and the errors
> > involved are likely to involve several bits, I think your chances of
> > having an uncaught data error are only four billion to one. Four
> > billion microseconds is about eighty minutes, a billion milliseconds
> > is about a month and a half, and four billion seconds is about 125
> > years.
>
> hmm, I'll admit I never actually looked at the details.
> the CRC is 16b (not really surprising, since ATA is that wide):
> G(X) = X15 + X12 + X5 + 1.
>
> so I think your point was to be less blase' about badCRC reports,
> and you're certainly right. hmm, so the chance of undetected errors
> depends on tranfers/second, right? so figuring a worst-case ATA100
> and nothing but 4K transfers, we'd see something like 20K t/s.
> hmm, how do you go from those numbers to mean time to undetected failure?
Well, you need to know what the mean time to detected failure is.
> I think your back-of-envelope numbers were assuming 1 transfer per us,
> right? so with 16b CRC, you'd expect an uncaught error in 64K/20K=3 s.
> but is that assuming some particular distribution of errors?
The three-second figure would be roughly correct if every transfer had
a many-bit error. In general, you'd expect that one many-bit error
out of every 64K would be undetected. (I think the CRC will detect
all one-bit errors, although I can't remember.)
If all your observed failures are one-bit errors, you could square
their frequency to get an estimate of the number of two-bit errors. I
think "two" is a big enough number to be "many", but I'm not sure.
More information about the Beowulf
mailing list