[Beowulf] Re: failure trends in a large disk drive population
Jim Lux
James.P.Lux at jpl.nasa.gov
Thu Feb 22 10:40:58 PST 2007
At 08:22 AM 2/22/2007, David Mathog wrote:
>Justin Moore wrote:
> > As mentioned in an earlier e-mail (I think) there were 4 SMART variables
> > whose values were strongly correlated with failure, and another 4-6 that
> > were weakly correlated with failure. However, of all the disks that
> > failed, less than half (around 45%) had ANY of the "strong" signals and
> > another 25% had some of the "weak" signals. This means that over a
> > third of disks that failed gave no appreciable warning. Therefore even
> > combining the variables would give no better than a 70% chance of
> > predicting failure.
>
>Now we need to know exactly how you defined "failed".
The paper defined failed as "requiring the computer to be pulled"
whether or not the disk was actually dead.
Were there postmortem analyses of the power supplies in the failed
>systems? It wouldn't surprise me if low or noisy power lines led
>to an increased rate of disk failure. SMART wouldn't give this
>information (at least, not on any of the disks I have), but
>lm_sensors would.
I would make the case that it's not worth it to even glance at the
outside of the case of a dead unit, much less do failure analysis on
the power supply. FA is expensive, new computers are not. Pitch the
dead (or "not quite dead yet, but suspect") computer, slap in a new
one and go on.
There is some non-zero value in understanding the failure mechanics,
but probably only if the failure rate is high enough to make a
difference. That is, if you had a 50% failure rate, it would be
worth understanding. If you have a 3% failure rate, it might be
better to just replace and move on.
There is also some value in predicting failures, IF there's an
economic benefit from knowing early. Maybe you can replace computers
in batches less expensively than waiting for them to fail or maybe
your in a situation where a failure is expensive (highly tuned
brittle software with no checkpoints that has to run 1000 processors
in lockstep for days on end). I can see Google being in the former
case but probably not in the latter. Predictive statistics might
also be useful if there is some "common factor" that kills many disks
at once (Gosh, when Bob is the duty SA after midnight and it's the
full moon, the airfilters clog with some strange fur and the drives
overheat, but only in machine rooms with a window to the outside..)
James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875
More information about the Beowulf
mailing list