[Beowulf] Re: failure trends in a large disk drive population

David Mathog mathog at caltech.edu
Thu Feb 22 12:30:21 PST 2007


Jim Lux wrote:

> >Now we need to know exactly how you defined "failed".
> 
> The paper defined failed as "requiring the computer to be pulled" 
> whether or not the disk was actually dead.

That was sort of my point, if you're looking for indicators that
lead to "failed disk" there should be a precise definition of
what "failed disk" is.  How am I to know what criteria Google uses 
for classifying a machine as nonfunctioning?    If the system is
pulled because the CPU blew up it's one thing, but if they pulled it
for any disk related reason, we need to know how bad "bad" was.


> I would make the case that it's not worth it to even glance at the 
> outside of the case of a dead unit, much less do failure analysis on 
> the power supply.  FA is expensive, new computers are not.  Pitch the 
> dead (or "not quite dead yet, but suspect") computer, slap in a new 
> one and go on.

Well, they cared enough to do the study!

I think the heart of the problem is that disk failures are a bit like
airplane crashes: everything looks great until something snaps and then
the plane goes down shortly thereafter.  Similarly, there's just
not that much time between the cause of the failure manifesting
itself and the final disk failure.  Once the disk heads start
bouncing off the disk, or some piece of dirt or metal shaving
gets between the disks and the heads, its all over pretty quickly. 
Until that point there may be a few weak indications that something
is wrong, but they may or may not have a relation to the final
failure event.  For instance, a tiny bit of junk stuck to the
surface may cause a few blocks to remap and never do anything else.
It might or might not mean that a huge chunk of the same stuff is
about to wreak havoc.  (It's absence is clearly preferred though, since
any remapped blocks can result in data loss.)

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list