[Beowulf] Re: Cooling vs HW replacement

Jim Lux jimlux at earthlink.net
Mon Jan 24 06:58:33 PST 2005

----- Original Message -----
From: "Robert G. Brown" <rgb at phy.duke.edu>
To: "Greg Lindahl" <lindahl at pathscale.com>
Cc: <beowulf at beowulf.org>
Sent: Sunday, January 23, 2005 10:57 PM
Subject: Re: [Beowulf] Re: Cooling vs HW replacement

> On Sun, 23 Jan 2005, Greg Lindahl wrote:
> I think >>everybody<< finds that actual failure rates are (at least)
> 2x-3x the mfr number, and finds that it varies wildly in time and with
> environmental conditions and with plain old luck.  That's why (and what
> I mean by stating that) mfr MTBF quotations are optimistic and cheery.

Such may be the case, but as Greg pointed out, they are a standard measure
of reliability of a device.

Actually, I'd trust the MTBF and other reliability data more than the
warranty, and here's why:

1) Warranty terms are economics and marketing driven.  They're set based
(partially) on what the manufacturer thinks is a reasonable expenditure for
warranty replacements.  If Brand A offers a 2 year warranty and Brand B
offers a 3 year warranty, on the same drive, at the same price, people will
buy Brand B, improving Brand B's short term revenue (potentially at some
downstream cost a couple years from now).    The "lemon" phenomenon is
probably more model based than serial number based, given the consistency of
modern manufacturing processes. (ISO 9000 compliance means that you'll
produce the same icky piece of hardware in exactly the same way each time)

2) A warranty is a mere contractual detail, subject to negotiation between
vendor and customer. Of course, we, as retail customers, tend not to have
much negotiating power here, but I'd imagine that the sales agreement
between, for instance, Dell and Seagate, has very different warranty terms,
even if the drives are identical.

3) Most people never collect on warranties, even if the equipment fails. The
sellers are well aware of this. Otherwise why would there be "lifetime"
warranties on things, which clearly will fail or become useless eventually.
"Extended service contracts" are a huge money maker for just this reason.

4) If a mfr sells a product that, for some reason, has problems, they don't
usually adjust the warranty. If its an expensive item (like a car), they may
have a perverse incentive to acknowledge that the problem exists, because it
would trigger a flood of warranty requests from purchasers whose units
haven't failed yet.   This sometimes results in in huge class-action
lawsuits and things like lemon laws. Google for "1.8T Passat Sludge"

4) An MTBF specification is a testable, verifiable number.  If I put out a
procurement for disk drives, and I require an MTBF of, say, 1,000,000 hours
in a particular environment, the vendor has to meet that requirement, and
demonstrate that it has done so in some way (in this case: by a combination
of "similarity", "analysis", and "test", since they obviously couldn't test
the delivered article to death).  At some point, the vendor is going to sign
a piece of paper that says that "this shipment meets all requirements as
specified in ...".

> As was pointed out by Karen (and I agree) the mfr warranty period is
> perhaps a better number for most people to pay attention to than MTBF as
> it is the only number that actually costs mfrs money when a disk
> "prematurely" fails and the only number that does you any good if you
> buy a hundred disks -- or even just one -- that turn out to be from a
> "bad batch".  Being a cynic, I cannot keep from thinking of the dozens
> of ways an overgood MTBF number could be "cooked" by a mfr, the near
> certainty that nobody will ever do anything like a study that could
> refute it if they pulled it out of thin air, and the lack of financial
> incentives to make it pessimistic or even acurate.

The financial incentive is that if they deliver a product stated to meet a
particular MTBF spec, and they don't, they are committing a fraud, which has
substantial penalties (not just financial) associated with it.

Putting an unrealistic warranty on something is mere marketing, and the only
penalty is a possible financial one, the size of which is determined by many
things other than failure rates, and further, which is far into the future,
long after this year's (quarter's) bonuses related to shareholder return and
revenue have been distributed.

 Maybe they are all
> perfectly honest and drive failure rates are really just 1%/year or
> thereabouts (on the bathtub floor) and I just never noticed it, or was
> unlucky, or beat the disks to death by using them in actual computers
> that only rarely used the disks at all;-) With a warranty, though,
> while I still care I care less -- I still have to hassle with the
> replacement but I don't have to buy the disk over again, even if it is
> just one drive in 100 in a year.

If you really are concerned about failure rates, then get a reliability
engineer to look at their data and make a "real life" assessment. Properly
evaluating the data and interpreting it is non-trivial.

Especially if you're buying hundreds or thousands of disks, you should be
putting hard requirements in your procurement spec for reliability.  You
could even offer to buy them with NO warranty at a discount, since the
majority of the cost of a failure falls on you anyway (time and hassle).
This is definitely where buying your hardware at the corner computer store
(or at a "big-box" store) isn't a good thing.  These retailers don't have
the skill set, nor the incentive, to properly assess what are fairly arcane

Jim Lux

More information about the Beowulf mailing list