[Beowulf] Re: failure trends in a large disk drive population

Fri Feb 16 13:40:59 PST 2007

Hi David

David Mathog wrote:
> Eugen Leitl <eugen at leitl.org> wrote:
> 
>> http://labs.google.com/papers/disk_failures.pdf
> 
> Interesting.  However google apparently uses:
> 
>   serial and parallel ATA consumer-grade hard disk drives,
>   ranging in speed from 5400 to 7200 rpm
> 
> Not quite clear what they meant by "consumer-grade", but I'm assuming
> that it's the cheapest disk in that manufacturer's line.  I don't
> typically buy those kinds of disks, as they have only a 1 year
> warranty but rather purchase those with 5 year warranties.  Even
> for workstations.

Seagates.

> 
> So I'm not too sure how useful their data is.  I think everyone here

Quite useful IMO.  I know it would be PC, but I (and many others) would
like to see a clustering of the data, specifically to see if there are
any hyperplanes that separate the disks in terms of vendors, models,
interfaces, etc.  CERN had a study up about this which I had read and
linked to, but now it seems to be gone, and I did not download a copy
for myself.

> would have agreed without the study that a disk reallocating blocks and
> throwing scan errors is on the way out.  Quite surprising about the

"Tic tic tic whirrrrrrr"  scares the heck out of me now :(

> lack of a temperature correlation though.  At the very least I would
> have expected increased temps to lead to faster loss of bearing
> lubricant.  That tends to manifest as a disk that spun for 3 years
> not being able to restart after being off for a half an hour.  
> Presumably you've all seen that. If they have great power and systems
> management at their data centers the systems may not have been
> down long enough for this to be observed.

With enough disks, their sampling should be reasonably good, albeit
biased towards their preferred vendor(s) and model(s).  Would like to
see that data.  CERN compared SCSI, IDE, SATA, and FC.  They found (as I
remember, quoting from a document I no longer can find online) that
there really weren't any significant reliability differences between them.

I would like to see this sort of analysis here, and see if the real data
(not the estimated MTBFs) shows a signal.  I am guessing that we could
build a pragmatic and time dependent MTBF based upon the time rate of
change of the AFR.  I think the Google paper was basically saying that
they wanted to do something like this using the SMART data, but found
that it was insufficient by itself to render a meaningful predictable
model.  That is, in and of itself, quite interesting.  If you could read
back reasonable sets of parameters from a machine and estimate the
likelihood of it going south, this would be quite nice (or annoying) for
admins everywhere.

Also good in terms of tightening down real support costs and the value
of warranties, default and extended.

> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615