Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: failure trends in a large disk drive population

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Mark Hahn hahn at mcmaster.ca
Wed Feb 21 18:44:26 PST 2007


> weakly correlated with failure.  However, of all the disks that failed, less 
> than half (around 45%) had ANY of the "strong" signals and another 25% had 
> some of the "weak" signals.  This means that over a third of disks that 
> failed gave no appreciable warning.  Therefore even combining the variables 
> would give no better than a 70% chance of predicting failure.

well, a factorial analysis might still show useful interactions.


> number of disks.  For example, among the disks that failed, many had a large 
> number of seek error; however, over 70% of disks in the fleet -- failed and 
> working -- had a large number of seek errors.

was there any trend across time in the seek errors?


> So that's our master plan.  Just don't tell anyone. :)

hah.  well, if it were me, the M.P. would involve some sort of proactive
treatment: say, a full-disk read once a day.  smart self-tests _ought_ 
to be more valuable than that, but otoh, the vendor probably munge the 
measurements pretty badly.

regards, mark hahn.



More information about the Beowulf mailing list