[Beowulf] Re: failure trends in a large disk drive population

David Mathog mathog at caltech.edu
Fri Feb 16 14:05:40 PST 2007


Justin Moore wrote:
> Subject: Re: [Beowulf] failure trends in a large disk drive population
> To: Eugen Leitl <eugen at leitl.org>
> Cc: Beowulf at beowulf.org
> Message-ID: <Pine.LNX.4.63.0702161515530.20861 at kahlo.cs.duke.edu>
> Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
> 
> 
> > http://labs.google.com/papers/disk_failures.pdf
> 
> Despite my Duke e-mail address, I've been at Google since July.  While 
> I'm not a co-author, I'm part of the group that did this study and can 
> answer (some) questions people may have about the paper.
> 

Dangling meat in front of the bears, eh?  Well...

Is there any info for failure rates versus type of main bearing
in the drive?

Failure rate versus any other implementation technology?  

Failure rate vs. drive speed (RPM)?

Or to put it another way, is there anything to indicate which
component designs most often result in the eventual SMART
events (reallocation, scan errors) and then, ultimately, drive
failure? 

Failure rates versus rack position?  I'd guess no effect here,
since that would mostly affect temperature, and there was
little temperature effect.

Failure rates by data center?  (Are some of your data centers
harder on drives than others?  If so, why?)  Are there air
pressure and humidity measurements from your data centers? 
Really low air pressure (as at observatory height)
is a known killer of disks,  it would be interesting if lesser
changes in air pressure also had a measurable effect.  Low
humidity cranks up static problems, high humidity can result
in condensation.  Again, what happens with values in between?
Are these effects quantifiable?

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list