Disk reliability (Was: Node cloning)

Mark Hahn hahn at coffee.psychology.mcmaster.ca
Sun May 27 09:23:02 PDT 2001


> > > You can try using hdparm to turn the DMA off.  Of course, it does slow
> > > down data transfer rates considerably.
> > 
> > As Mark said, BadCRC only means that the transfer was retried.  If a few
> > BadCRC messages are the only problem, I would not turn off DMA.
> 
> What size of CRCs are being used?  If it's a 32-bit CRC and the errors
> involved are likely to involve several bits, I think your chances of
> having an uncaught data error are only four billion to one.  Four
> billion microseconds is about eighty minutes, a billion milliseconds
> is about a month and a half, and four billion seconds is about 125
> years.

hmm, I'll admit I never actually looked at the details.
the CRC is 16b (not really surprising, since ATA is that wide):
G(X) = X15 + X12 + X5 + 1.

so I think your point was to be less blase' about badCRC reports,
and you're certainly right.  hmm, so the chance of undetected errors
depends on tranfers/second, right?  so figuring a worst-case ATA100
and nothing but 4K transfers, we'd see something like 20K t/s.
hmm, how do you go from those numbers to mean time to undetected failure?

I think your back-of-envelope numbers were assuming 1 transfer per us,
right?  so with 16b CRC, you'd expect an uncaught error in 64K/20K=3 s.
but is that assuming some particular distribution of errors?

thanks, mark hahn.






More information about the Beowulf mailing list