[Beowulf] Are disk MTBF ratings at all useful?

Sat Apr 20 13:58:01 PDT 2013

On 4/19/13 4:38 PM, "mathog" <mathog at caltech.edu> wrote:

>Joe Landman <landman at scalableinformatics.com> wrote
>
>> Use AFR and warranty, ignore everything else.  MTBF does not
>> correlate
>> at all against AFR, and AFR is an objective measure.
>
>MTBF is the inverse of the AFR times the number of hours in a year.
><snip>
>The ratings I would really like the industry to use might be called
>ef1, ef5, and ef10, where each  is the percent of disks that are
>Expected to be Functioning (defined as: works at full rated speed,
>has suffered zero data losing events, and still has
>unused blocks available) at the end of the specified number
>of years.  It would be really easy to compare disks with that system.
>With AFR etc., not so much.
>

You can get that information, but it costs a lot and would likely be bound
up in NDAs.

It is the stock in trade of a disk drive manufacturer, and you can bet
that internally, they know a LOT about failure modes, rates, etc.

(without even going to the extremes described in Crichton's "Rising Sun")

If you could show that knowing this information (in a public way) would
make a significant difference in cluster engineering, it should be
possible to get funding to do the experiment yourself.  That is, buy 100
drives and run them at different temperatures, etc.   Do this for various
kinds, etc.  (This is what Google has done, internally, and they don't
publish the data because it is a strategic advantage.. I'm sure Amazon has
done the same)

The problem is that I think that for run of the mill cluster (or data
center) building, what they have is "good enough".  That is, you figure on
a refresh cycle of three years, and buy drives with a 4 year warranty. You
ignore MTBF.

I also have to comment that nobody who actually uses MTBF numbers (e.g.
DoD) actually believes the numbers.  Rather, they tend to be used as way
to compare different designs:  run design A and it gets 49,000hr and run
design B and it gets 20,000 hours.. Neither might actually go that many
hours, but A is clearly better than B.

To actually do a MIL-HDBK-217 analysis is tedious, and in practice,
everyone has their own schemes for derating, etc.  There's also the whole
logical and/or aspect of failure analysis, if there's any redundancy in
the system. 

Consider that you have 100 parts on a board, and you get MTBF numbers for
each, and then combine them to get a PCB level MTBF number.  But did you
assume all those parts are at the same temperature, or did you analyze the
temperature distribution across the board, and assign shorter MTBFs to the
hotter parts (in accordance with the appropriate scaling laws).

>