[Beowulf] Are disk MTBF ratings at all useful?

mathog mathog at caltech.edu
Fri Apr 19 08:47:16 PDT 2013


> On Apr 19, 2013, at 2:50, Fred Youhanaie <fly at anydata.co.uk> wrote:
>> On 19/04/13 00:01, mathog wrote:
>>> High end SATA and SAS disks claim MTBF values that work out to over 
>>> 100
>>> years, and yet it is a common
>>> observation that certain models fail at rates entirely inconsistent
>>> with those values.  For instance,
>>> 75% of all drives of one model dead in < 6 years.  (Cited by one 
>>> poster
>>> in this thread:
>>
>> You may find this paper helpful, some of the data sets used in their 
>> studies come from large HPC sites:
>>
>>    Bianca Schroeder, Garth A. Gibson
>>    Understanding disk failure rates: What does an MTTF of 1,000,000 
>> hours mean to you?
>>    http://dl.acm.org/citation.cfm?doid=1288783.1288785
>>
>> If you, or your institution, do not have access to the ACM 
>> publications, you may be able to find a free copy posted by the 
>> authors, ACM does allow that :)

Very good reference.  This is the second conclusion from that paper:

   For drives less than five years old, field replacement rates were 
larger than
   what the datasheet MTTF suggested by a factor of 2–10. For five to 
eight-year
   old drives, field replacement rates were a factor of 30 higher than 
what the
   datasheet MTTF suggested.

The paper discussed in some detail one key factor in this discrepancy - 
the end user's definition
of "failed" usually differs substantially from the vendor's.  For 
instance, I replace disks when they are
either accumulating swapped out sectors rapidly (write failures) or 
accumulate more than a few
pending errors (read failures).  The former indicate that the disk is 
going south, but no data is lost, and they
are not in themselves disruptive, the latter are disruptive since data 
is potentially lost on each
such event, and in any case, these events must be cleared manually.  
The vendors most likely would consider
neither of these a failure event since SMART will still read PASSED on 
such drives.

My overall impression is that, when buying drives, the single piece of 
manufacturer provided data that
best correlates with the actual expected life of the drive is the 
length of the warranty.  Even that is little
protection against a bad batch though.

Thanks,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list