[Beowulf] Are disk MTBF ratings at all useful?
Lux, Jim (337C)
james.p.lux at jpl.nasa.gov
Sat Apr 20 16:14:33 PDT 2013
On 4/20/13 3:08 PM, "Andrew Holway" <andrew.holway at gmail.com> wrote:
>Did anyone post this yet? I thinking this is one of the definitive
>works on disk failure.
>On 19 April 2013 17:56, Joe Landman <landman at scalableinformatics.com>
>> On 4/19/2013 11:47 AM, mathog wrote:
>>>> My overall impression is that, when buying drives, the single piece of
>>>> manufacturer provided data that
>>>> best correlates with the actual expected life of the drive is the
>>>> length of the warranty. Even that is little
>>>> protection against a bad batch though.
>> Use AFR and warranty, ignore everything else. MTBF does not correlate
>> at all against AFR, and AFR is an objective measure.
Some salient points from that article:
"The higher baseline AFR for 3 and 4 year old drives is more strongly
influenced by the underlying reliability of the particular models in that
vintage than by disk drive aging effects."
"For example, Figure 2 changes significantly when we normalize failure
rates per each drive model. Most age-related results are impacted by drive
vintages. However, in this paper, we do not show a breakdown of drives per
manufacturer, model, or vintage due to the proprietary nature of these
Yep.. Detailed failure stats are hard to come by, because they're valuable.
And of course, the interesting thing in that paper was that failure rates
are higher for the colder drives... It might a "tolerance" issue.. The
drives are optimized to work at a particular temperature (e.g. 40C) and
that's where all the stackup of tolerances (mechanical and timing)works
best. As you get away from that temperature, deviations from nominal (due
to aging or wear) are more likely to cause a failure.
There's also this:
"Yang and Sun  and Cole  describe the processes and experimental
setup used by Quantum and Seagate to test new units and the models that
attempt to make long-term reliability predictions based on accel- erated
life tests of small populations. Power-on-hours, duty cycle, temperature
are identified as the key deployment parameters that impact failure rates,
each of them having the potential to double failure rates when going from
nominal to extreme values.
This is quite interesting, because they see only a doubling in going to
extreme values. Clearly, this is not dominated by an Arrhenius double per
10 degrees kind of effect. Of course, that's contradicted by the very
next sentence: "For example, Cole presents thermal de-rating models
showing that MTBF could degrade by as much as 50% when going from
operating temperatures of 30C to 40C."
The net of all this is that (and I'll bet you if you read all 21 of the
references, you'll find this).. Disk drive life time is very hard to
More information about the Beowulf