Disk reliability (Was: Node cloning)
Mark Hahn
hahn at coffee.psychology.mcmaster.ca
Sun May 27 09:23:02 PDT 2001
> > > You can try using hdparm to turn the DMA off. Of course, it does slow
> > > down data transfer rates considerably.
> >
> > As Mark said, BadCRC only means that the transfer was retried. If a few
> > BadCRC messages are the only problem, I would not turn off DMA.
>
> What size of CRCs are being used? If it's a 32-bit CRC and the errors
> involved are likely to involve several bits, I think your chances of
> having an uncaught data error are only four billion to one. Four
> billion microseconds is about eighty minutes, a billion milliseconds
> is about a month and a half, and four billion seconds is about 125
> years.
hmm, I'll admit I never actually looked at the details.
the CRC is 16b (not really surprising, since ATA is that wide):
G(X) = X15 + X12 + X5 + 1.
so I think your point was to be less blase' about badCRC reports,
and you're certainly right. hmm, so the chance of undetected errors
depends on tranfers/second, right? so figuring a worst-case ATA100
and nothing but 4K transfers, we'd see something like 20K t/s.
hmm, how do you go from those numbers to mean time to undetected failure?
I think your back-of-envelope numbers were assuming 1 transfer per us,
right? so with 16b CRC, you'd expect an uncaught error in 64K/20K=3 s.
but is that assuming some particular distribution of errors?
thanks, mark hahn.
More information about the Beowulf
mailing list