[Beowulf] Re: Cooling vs HW replacement

Fri Jan 21 14:58:47 PST 2005

At 11:09 AM 1/21/2005, David Mathog wrote:

> >
> > > or "Server" grade disks still cost a lot more than that.  For
> >
> > this is a very traditional, glass-house outlook.  it's the same one
> > that justifies a "server" at $50K being qualitatively different
> > from a commodity 1U dual at $5K.  there's no question that there
> > are differences - the only question is whether the price justifies
> > those differences.
>
>The MTBF rates quoted by the manufacturers are one indicator
>of disk reliability, but from a practical point of view the number
>of years of warranty coverage on the disk is a more useful metric.
>
>The manufacturer has an incentive to be sure that those disks
>with a 5 year warranty really will last 5 years.  Unclear
>to me what their incentive is to support the MTBF rates since only
>a sustained and careful testing regimen over many, many disks could
>challenge the manufacturer's figures.  And who would run such
>an analysis???  Buy the 5 year disk and you'll have a working
>disk, or a replacement for it, for 5 years.

While MTBFs of the disk may seem unrealistic (as was pointed out, nobody is 
likely to run a single disk for 100+ years), but they are a "common 
currency" in the reliability calculation world, as are "Failures in Time" 
(FIT) which is the number of failures in a billion (1E9) hours of operation.

What would be very useful (and is something that does get analysis for some 
customers, who care greatly about this stuff) is to compare the MTBF of a 
widget determined by calculation and analysis (look up the component 
reliabilities, calculate the probability of failure for the ensemble) with 
the MTBF of the same widget determined by test (run 1000 disk drives for 
months).  Especially if you run what are called "accelerated life tests" at 
elevated temperatures or higher duty factors.

MTBFs are also used because they're easier to understand and handle than 
things like "reliability", which winds up being .999999999, or failure 
rates per unit time, which wind up being very tiny numbers (unless "unit 
time" is a billion hours).

And, if I were asked to estimate the reliability of a PC, I'd want to get 
the MTBF numbers for all the assemblies, and then I could calculate a 
composite MTBF, which might be surprisingly short.  If I then had to 
calculate how many PC failures I'd get in a cluster of 1000 computers, it 
would be appallingly short.

To a first order, an ensemble of 1000 units, each with an MTBF of 1E6 hours 
will have an MTBF of only 1000 hours, which isn't all that long....and if 
the MTBF of those units is only 1E5 hours, because you're running them 25 
degrees hotter than expected, only a few days will go by before you get 
your first failure.

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875