[Beowulf] Re: Cooling vs HW replacement

Robert G. Brown rgb at phy.duke.edu
Sun Jan 23 22:57:16 PST 2005

On Sun, 23 Jan 2005, Greg Lindahl wrote:

> On Sun, Jan 23, 2005 at 11:30:30AM -0500, Robert G. Brown wrote:
> > So I reiterate -- MTBF for hard disks, as reported by the manufacturer,
> > is a nearly useless number.
> It is useful if you use it for what it's meant to be used for: the
> failure rate in the bottom of the bathtub. I don't know why you were
> thinking of using it for anything else, like disk lifetime, or infant
> mortality. I have found that my actual failure rates have been 2X-3X
> the manufacturer's number, but you always have to worry about dust,
> power surges, and excess heat incidents in real machine rooms.

I think >>everybody<< finds that actual failure rates are (at least)
2x-3x the mfr number, and finds that it varies wildly in time and with
environmental conditions and with plain old luck.  That's why (and what
I mean by stating that) mfr MTBF quotations are optimistic and cheery.
If you've developed Kentucky Windage for their numbers that makes them
useful to you, that's great, but you've got a LOT of experience on which
to base that correction, and can still get burned by the fact that
actual failures are (at best) not terribly uniformly distributed -- the
"lemon" phenomenon of manufacturing, also known as "the box of disks
that fell from the truck during shipping".

Otherwise, what I was basically doing is describing the bathtub (which
might, in fact, be more of a kitchen sink with a quite small flat
region, given that the testing cannot, obviously, take long enough to
define a proper tub floor). That is, we don't really know much about the
bathtub size or shape for any drive except (perhaps) for whatever we can
infer from the mfr warranty on the particular drive in question, and
even THAT is bent out of ideal shape by the actual conditions (such as
the particular case it is mounted in and how good its ventilation is and
the temperature of the ambient air and how hard it is being run).

As was pointed out by Karen (and I agree) the mfr warranty period is
perhaps a better number for most people to pay attention to than MTBF as
it is the only number that actually costs mfrs money when a disk
"prematurely" fails and the only number that does you any good if you
buy a hundred disks -- or even just one -- that turn out to be from a
"bad batch".  Being a cynic, I cannot keep from thinking of the dozens
of ways an overgood MTBF number could be "cooked" by a mfr, the near
certainty that nobody will ever do anything like a study that could
refute it if they pulled it out of thin air, and the lack of financial
incentives to make it pessimistic or even acurate.  Maybe they are all
perfectly honest and drive failure rates are really just 1%/year or
thereabouts (on the bathtub floor) and I just never noticed it, or was
unlucky, or beat the disks to death by using them in actual computers
that only rarely used the disks at all;-) With a warranty, though,
while I still care I care less -- I still have to hassle with the
replacement but I don't have to buy the disk over again, even if it is
just one drive in 100 in a year.

Even the warranty period and marginal cost is a less than perfect
predictor.  I'll bet that in the consumer marketplace they don't
actually have to make good on more than two potential warranty claims
out of three for three year drives -- RMA is a PITA and probably daunts
many a should-be claimant after a 1 year system warranty expires, or
they are sold the systems and never told that the drives have a three
year warranty.  Dropping the warranty on most OTC disks to 1 year sends
a pretty negative signal to me, at least, as does the explicit marginal
cost of adding back the missing two years.  The dollar amounts imply
that the MANUFACTURERS are expecting a whole lot more than 1% of ANY
batch of disks to fail per year, even out there on the bathtub floor.
Who should I believe -- the MTBF or the money?

> MTBF for just about everything is computed the same way, and most gizmos
> have the same bathtub-shaped failure curve.

I'm reminded of a line in a statistics book I once read (I can't
remember which one, alas) in which the author had just done a lengthy
analysis of failure rates and probability and arrived at a
mathematically proven and statistically sound conclusion based on the
observations and premises, who then ended up his argument with "but
everybody >>knows<< that things go wrong more often than >>that<<" or
something similar.  His point (I think) was that statistics are lovely
but use your gut and your head as well -- reality check time.  I tend to
think more in terms of warranties and Murphy than in mfr's MTBF,
especially when MTBF is a number with absolutely no financial penalty
attached to it derived from measurements that (necessarily) are not in
the actual context of most usage.


Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list