[Beowulf] Re: failure trends in a large disk drive population

Jim Lux James.P.Lux at jpl.nasa.gov
Thu Feb 22 17:12:10 PST 2007

At 12:30 PM 2/22/2007, David Mathog wrote:
>Jim Lux wrote:
> > >Now we need to know exactly how you defined "failed".
> >
> > The paper defined failed as "requiring the computer to be pulled"
> > whether or not the disk was actually dead.
>That was sort of my point, if you're looking for indicators that
>lead to "failed disk" there should be a precise definition of
>what "failed disk" is.  How am I to know what criteria Google uses
>for classifying a machine as nonfunctioning?    If the system is
>pulled because the CPU blew up it's one thing, but if they pulled it
>for any disk related reason, we need to know how bad "bad" was.

True.. there's a paragraph or so of how they determined "failed" 
(e.g. they didn't include drives removed from service because of 
scheduled replacement).

> > I would make the case that it's not worth it to even glance at the
> > outside of the case of a dead unit, much less do failure analysis on
> > the power supply.  FA is expensive, new computers are not.  Pitch the
> > dead (or "not quite dead yet, but suspect") computer, slap in a new
> > one and go on.
>Well, they cared enough to do the study!

Or, more realistically, that the small dollars spent on the study to 
identify a possible connection was tiny enough that it's probably 
down in the overall budgetary noise floor.

>I think the heart of the problem is that disk failures are a bit like
>airplane crashes: everything looks great until something snaps and then
>the plane goes down shortly thereafter.

I think one of the values of the study was that it actually did 
demonstrate just that.. you really can't do a very good job 
predicting failures in advance, so you'd better have a system in 
place to deal with the inevitable failures while they're in service.

And, of course, they have some "real numbers" on failure rates, which 
is useful in and of itself, regardless of whether the failures could 
be predicted.

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875 

More information about the Beowulf mailing list