[Beowulf] Re: failure trends in a large disk drive population

Mark Hahn hahn at mcmaster.ca
Wed Feb 21 18:44:26 PST 2007

> weakly correlated with failure.  However, of all the disks that failed, less 
> than half (around 45%) had ANY of the "strong" signals and another 25% had 
> some of the "weak" signals.  This means that over a third of disks that 
> failed gave no appreciable warning.  Therefore even combining the variables 
> would give no better than a 70% chance of predicting failure.

well, a factorial analysis might still show useful interactions.

> number of disks.  For example, among the disks that failed, many had a large 
> number of seek error; however, over 70% of disks in the fleet -- failed and 
> working -- had a large number of seek errors.

was there any trend across time in the seek errors?

> So that's our master plan.  Just don't tell anyone. :)

hah.  well, if it were me, the M.P. would involve some sort of proactive
treatment: say, a full-disk read once a day.  smart self-tests _ought_ 
to be more valuable than that, but otoh, the vendor probably munge the 
measurements pretty badly.

regards, mark hahn.

