[Beowulf] Are disk MTBF ratings at all useful?
Lux, Jim (337C)
james.p.lux at jpl.nasa.gov
Mon Apr 22 21:39:27 PDT 2013
From: "Peter St. John" <peter.st.john at gmail.com<mailto:peter.st.john at gmail.com>>
Date: Monday, April 22, 2013 6:19 PM
To: mathog <mathog at caltech.edu<mailto:mathog at caltech.edu>>
Cc: "beowulf at beowulf.org<mailto:beowulf at beowulf.org>" <beowulf at beowulf.org<mailto:beowulf at beowulf.org>>
Subject: Re: [Beowulf] Are disk MTBF ratings at all useful?
Human mortality has, broadly, a Poisson, and a non-Poisson, component. The chance of getting hit by a meteor is Poisson, it has nothing to do with your age; but the chance of a 99 year old living to 100 is lower than the chance of a 20 year old living to 21, because we wear out, that's not Poisson. (Dogs are a clearer example: the chance of getting hit by a car is Poisson, but dying of old age after a dozen years or so is not.)
We usually think of incandescent light bulbs as Poisson; the chance of, I don't know, Brownian Motion, clipping a very narrow filament, is bigger than the degradation of mere use; except in the case of switching the bulb off and on frequently, when the chance of failure depends more on fatigue as the filament expands and contracts.
Hard Disks are somewhat Poisson, and somewhat not. More so, I think, than humans.
What you are describing is the standard bathtub curve, where the failure rate is constant on the "bottom" of the bathtub. Infant mortality isn't an issue any more, and old age/wearout hasn't started.
I would say that the real question is "where is the far side of the bathtub" where the rate starts climbing steeply. That's the important number, and one that is NOT necessarily the MTBF. I suspect the "calculated" MTBF in a system without any big wearout mechanisms would be essentially the inverse of the failure rate in the flat part of the curve. However, electromechanical devices DO have wear-out mechanisms, and they likely have shorter life that the electronics.
Furthermore, the wear life might some complicated thing like "integrated head motion" with some very complicated power laws. As an example of a seemingly simple component with a complex life phenomenon, take capacitors used for pulsed power systems.. They typically have a life that goes something like
Lx = Lref * (Qref/Qx)^1.6 * (Vref/Vx)^7.5
The wearout mechanism has to do with internal mechanical stresses. So, increasing the Q of the circuit increases the amount of voltage reversal as the exponentially damped sine wave rings down. And voltage has a very strong effect on life, because it is directly related to the mechanical loads, as well as the electrical stress on the dielectric.
Another common device with a not entirely intuitive life characteristic is an incandescent light bulb. Life goes as (variously) the 12th to 16th power of voltage (higher voltage = shorter life), while light output goes as the 3.4 power of voltage. So you could have a usage pattern that seems equivalent in terms of operating hours, or total lumen-seconds produced, and have very different life.
The same is no doubt true of disk drives. While the google folks didn't find any big obvious patterns (other than failure rates increasing at low and high temps), they also commented that their sample was non-homogenous, so you could be looking at the equivalent of 100 Volt, 120Volt and 130Volt lightbulbs all running off the same 115V circuit.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf