[Beowulf] Are disk MTBF ratings at all useful?

Fred Youhanaie fly at anydata.co.uk
Fri Apr 19 02:50:42 PDT 2013

On 19/04/13 00:01, mathog wrote:
> High end SATA and SAS disks claim MTBF values that work out to over 100
> years, and yet it is a common
> observation that certain models fail at rates entirely inconsistent
> with those values.  For instance,
> 75% of all drives of one model dead in < 6 years.  (Cited by one poster
> in this thread:
> https://groups.google.com/forum/#!topic/comp.unix.solaris/zQjoyc8T01Y
> ).  Additionally, manufacturer warranties at best only go to 5 years,
> which suggests the manufacturers
> don't have a whole lot of faith in their MTBF values.
> Some of you have huge amounts of storage, how many disk models lasted
> as long as their MTBF suggests
> they should?  (Personally we have only one set of disks that are still
> consistent with the claimed MTBF,
> a set of 6 Fibre Channel disks that came with a Sun server and are now
> 10 years old - with no failures.)

You may find this paper helpful, some of the data sets used in their studies come from large HPC sites:

	Bianca Schroeder, Garth A. Gibson
	Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

If you, or your institution, do not have access to the ACM publications, you may be able to find a free copy posted by the authors, ACM does allow that :)

> How do they come up with the MTBF values for disks anyway?  Clearly it
> is not based on watching a large
> sample of disks for countless years!

I can't remember if I have read it in the above paper or elsewhere that users in the field tend to replace disks on the first signs of failure, e.g. SCSI warnings, while manufacturers' tests may run 
to total failure, which leads to claims of longer MTTF/MTBF values by the manufacturers.


More information about the Beowulf mailing list