[Beowulf] Re: failure trends in a large disk drive population

David Mathog mathog at caltech.edu
Thu Feb 22 08:22:34 PST 2007

Justin Moore wrote:
> As mentioned in an earlier e-mail (I think) there were 4 SMART variables 
> whose values were strongly correlated with failure, and another 4-6 that 
> were weakly correlated with failure.  However, of all the disks that 
> failed, less than half (around 45%) had ANY of the "strong" signals and 
> another 25% had some of the "weak" signals.  This means that over a 
> third of disks that failed gave no appreciable warning.  Therefore even 
> combining the variables would give no better than a 70% chance of 
> predicting failure.

Now we need to know exactly how you defined "failed".  Presumably
AFTER you have determined that a disk has failed various SMART
parameters have very high values.  As you say, before there
are SMART indicators but no clear trend.  What separates one set
of SMART values (indicator) from the other (failed)?

Is it possible that more frequent monitoring of SMART variables
could catch the early failure (chest pains, so to speak) before
the total failure (fatal heart failure)?  This might give a few
more seconds or minutes warning before disk failure, possibly
enough time for a node to indicate it is about to fail and shutdown,
especially if it can do so without writing much to the disk.
Admittedly, this would not be nearly as useful as knowing that
a disk will fail in a week!

Disks that just stop spinning or won't spin back up (motor/spindle
failure) are another problem that presumably cannot be detected by
SMART.  However this mode of failure is usually only seen in DOA disks
and old, old disks.  What fraction of the failed disks were this
type of failure?

Were there postmortem analyses of the power supplies in the failed
systems?  It wouldn't surprise me if low or noisy power lines led
to an increased rate of disk failure.  SMART wouldn't give this
information (at least, not on any of the disks I have), but
lm_sensors would.


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

More information about the Beowulf mailing list