[Beowulf] Re: Cooling vs HW replacement

Karen Shaeffer shaeffer at neuralscape.com
Thu Jan 27 00:48:44 PST 2005

On Sun, Jan 23, 2005 at 11:14:14PM -0800, Greg Lindahl wrote:
> On Mon, Jan 24, 2005 at 01:57:16AM -0500, Robert G. Brown wrote:
> > Otherwise, what I was basically doing is describing the bathtub
> Didn't look like that to me, but I just read your rants, I wasn't the
> guy who wrote them.
> > As was pointed out by Karen (and I agree) the mfr warranty period is
> > perhaps a better number for most people to pay attention to than MTBF
> I disagree. The warranty period tells you about disk lifetime. The
> MTBF tells you about the failure rate in the bottom of the
> bathtub. These are nearly independent quantities; I already pointed
> out that the fraction of disks which fail in the bottom of the bathtub
> is small, even if you multiply it by 2X or 3X. So the major factor in
> the price and length of that warranty is the lifetime.
> Lifetime and MTBF are simply different measures. Depending on what you
> are thinking about, you pay attention to one or the other or neither.

These numbers are defined by their collective usage in the industry. I
accept your assertion about their definitions. But the MTBF number has
no consequential significance to a disk drive manufacturer, and thus
has a poor confidence associated with it -- and I am going to explain

As stated previously, the disk drive business is an extremely high
volume, low margin, technology intensive business. Product cycles last
about 6 months. A typical disk drive comes out of development and
ramps up from zero units to several million units within about 6 weeks.
This is an operational miracle in of its self, but it is standard buisness
in this industry.

Now, I assert the warranty period and the integration of the failure rate
during the six month production run is the only issue the disk drive
manufacturer (DDM) cares about. During this period, as production is in
progress, the failure rate is dominated by the infant mortality of recently
sold drives from this production run. Even the first batch of production
drives sold are only 6 months into their lifecycle at the end of the
production run. (Let's define the infant mortality time window to be a
weighted 6 week period. This is the left wall of the bathtub.) The rate of
failure, from the drives that make it through the infant mortality time
period, is a very small component compared to the drives that are failing
during infant mortality. In other words, with several million drives a month
being sold, this infant mortality of recently sold drives is the dominating
term in the equation during the life of the production run.

Now, failed drives are classified in numerous ways by the DDM, but the most
important issue is how long it lived in production. Was it an infant
mortality rate death? If it was, then the DDM is very interested in it. If
it survived the infant mortality failure time window, then the DDM has very
little (if any) interest in it during the production run. (Someone asserted
the DDM will take failed drives and determine the failure mode. This is only
true of failures during infant mortality. Drives that fail in the bottom
of the bathtub are generally thrown away and simply replaced. You would need
to exceed the standard deviation for the MTBF significantly, before the
DDM would start analyzing the failure modes in this case. On the other hand,
any perceived aberation in the expected infant mortality rate would start a
fire drill.)

The point is, as in ALL mass produced products, and especially in
semiconductors and disk drives, early detection of statistically significant
failure modes is ABSOLUTELY ESSENTIAL to the profitability of the firm. If
you have a production problem and don't figure it out until a million drives
are out there in the market, then you have just lost a huge amount of money.
I'm talking about a whole quarter's profits or worse. If you have a role in
such a disaster -- then your career in the DDM industry just ended, which is
why everyone is keenly focused on the issues that matter.

In summary, for a specific disk drive, the bulk of the failed drives that
the DDM will replace based on the warranty will have in fact failed during
their infant mortality window immediately after being put into production.
If you integrate these over the product's production and then calculate the
drives that fail based on MTBF numbers associated with the bottom of the
bathtub, you will find this is a minor term in total number of failed drives
during the warranted time. (This is clearly true in the normal case.)

The DDM is entirely focused on these infant mortality failures, because they
provide early warning for preventing large scale problems. (This includes
the case where actual midlife failure rates would far exceed the projections
one could expect based on MTBF. This is an essential element in this
discussion. Please keep it in mind.) The scrutiny of these infant mortality
failures is intense. Any aberations of the expected numbers causes the whole
production and engineering teams associated with the particular disk drive
product to become available resources to resolve the problem. The time to
resolving problems is counted in hours. Its that intense. Do the math.
Several million units produced per month translates to 27,397 drives
produced per hour. If you have a problem, it can become a disaster quickly.
(Note the DDM actually puts several thousand drives in a QA lab about 6
weeks prior to shipping the first drive. So they actually have the initial
statistical results of the infant mortality before shipping any drives.)

Now, once production ends, then the infant mortality deaths drop off as soon
as the supply of drives in the channel are all sold. At this point, there is
nothing the DDM can do. The drives are out there. Whether or not the total
number of warranted drives that fail will have exceeded expectations is
already known in almost all cases.

All resources are now turned to the next product release. Nobody at the DDM
gives a whoot about the MTBF numbers. They are not even discussed within the
internal workings of the DDM business.

The POINT IS, even if the MTBF numbers are not accurate, and the drives
fail at a much higher rate, there is nothing the DDM can actually do about
it. Production is over. It is this reality that relegates these numbers to
be nothing more than window dressing on marketing literature. And there are
numerous products where MTBF rates have been wildly understated WRT the
actual midlife failure rates -- where the DDM took a big loss. But the
reaction, after the fact, would all be focused on why the early detection
and component QA processes failed. It would not even consider how the MTBF
numbers were derived. Because they need to catch the problem early or it
is not helpful.

So, now that we know what interests the DDM with respect to failed drives,
let's consider the MTBF numbers that are published. As Greg and others have
pointed out, these numbers speak of the rate of failure at the bottom of
the bathtub. The definition is not in question here. The question is the
confidence you can place on those published numbers.

I and others have asserted you cannot place much confidence in these
numbers, because they have no financial consequence to the DDM. (Except
of course if they are wildly wrong -- which brings with it the particular
problem of being too late to do anything about it.)  I have explained why
this is so. I have also explained how the DDM assigns all it's resouces to
the critical problems, as the rate of production is so high, time is the
essence in protecting profits. Once production ends, all resources are
reassigned to the next product to be released.

It is my understanding that these MTBF numbers are derived from thermal
cycling in ovens as part of the QA process. All the likely failure modes
in a disk drive are quite sensitive to thermal conditions. These are the
media, the heads, the spindle, bearings, lubricants, etc comprising the
critical mechanical structure, the temperature dependence on band gaps and
other calibrating circuitry within the electronics, nominal currents within
the microeclectronics and espectially the power mosfet arrays, the servo
system cailibration, etc. As the thermal cycling QA processes proceed,
defects in these systems can be forced to manifest during the testing, and
the normal state characteristics and stability of these subsystems can also
be extracted from the experiments. These results are then rigorously
integrated within the observed profiles and characteristics of drives
failing within the infant mortality window. It is all highly integrated
within statistical models for expectations. MTBF numbers are also
extrapolated from the results. In effect, the MTBF numbers become the long
term projections that are extrapolated from this data. But the primary
focus and optimization of processess is intended to create the statistical
underpinning from which to analyze infant mortality drive failures. The
uncertainty in these numbers naturally increases for the MTBF

It's all perfectly logical.

> Yes. And I have yet to see anything in your complaint that is anything
> but misinterpretation on your part. Reality check time, indeed. You
> can't use MTBF by itself as a measure of quality, period, so complaining
> that it isn't a good single item to measure disk quality is, well,
> operator error.
> -- greg

I think the problem with the logic you and others have embraced, is that
it is not well correlated with the operational priorities of the DDM
industry. As with all industries, competitors publish normalized metrics
for customers to compare. (and of course, they want you to think these
metrics are really important! They want you to buy their product.) I
believe the MTBF number is more of a marketing number than something the
DDM goes to great lengths to formulate. On the other hand, in a well
executed production run, where everything goes as planned, the MTBF
numbers are likely to be accurate. After all, MTBF and infant mortality
rates clearly share dependencies in the normal case -- and in fact, the
MTBF numbers are derived from the processes optimized for anticipating
the expected infant mortality rate during the production run.

If DDMs were interested in helping customers discriminate based on the
actual expected lifetime of drives, they would all publish running infant
mortality rates, updated weekly, during the production run of their disk
drives. Afterall, this is the one metric the entire organization is focused
on during production. But, what they hand out is this MTBF number to
prospective customers. A number they pay no attention to internally.

 Karen Shaeffer
 Neuralscape, Palo Alto, Ca. 94306
 shaeffer at neuralscape.com  http://www.neuralscape.com

More information about the Beowulf mailing list