[Beowulf] Are disk MTBF ratings at all useful?
mathog
mathog at caltech.edu
Fri Apr 19 08:47:16 PDT 2013
> On Apr 19, 2013, at 2:50, Fred Youhanaie <fly at anydata.co.uk> wrote:
>> On 19/04/13 00:01, mathog wrote:
>>> High end SATA and SAS disks claim MTBF values that work out to over
>>> 100
>>> years, and yet it is a common
>>> observation that certain models fail at rates entirely inconsistent
>>> with those values. For instance,
>>> 75% of all drives of one model dead in < 6 years. (Cited by one
>>> poster
>>> in this thread:
>>
>> You may find this paper helpful, some of the data sets used in their
>> studies come from large HPC sites:
>>
>> Bianca Schroeder, Garth A. Gibson
>> Understanding disk failure rates: What does an MTTF of 1,000,000
>> hours mean to you?
>> http://dl.acm.org/citation.cfm?doid=1288783.1288785
>>
>> If you, or your institution, do not have access to the ACM
>> publications, you may be able to find a free copy posted by the
>> authors, ACM does allow that :)
Very good reference. This is the second conclusion from that paper:
For drives less than five years old, field replacement rates were
larger than
what the datasheet MTTF suggested by a factor of 2–10. For five to
eight-year
old drives, field replacement rates were a factor of 30 higher than
what the
datasheet MTTF suggested.
The paper discussed in some detail one key factor in this discrepancy -
the end user's definition
of "failed" usually differs substantially from the vendor's. For
instance, I replace disks when they are
either accumulating swapped out sectors rapidly (write failures) or
accumulate more than a few
pending errors (read failures). The former indicate that the disk is
going south, but no data is lost, and they
are not in themselves disruptive, the latter are disruptive since data
is potentially lost on each
such event, and in any case, these events must be cleared manually.
The vendors most likely would consider
neither of these a failure event since SMART will still read PASSED on
such drives.
My overall impression is that, when buying drives, the single piece of
manufacturer provided data that
best correlates with the actual expected life of the drive is the
length of the warranty. Even that is little
protection against a bad batch though.
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list