[Beowulf] Re: Cooling vs HW replacement

Josip Loncaric josip at lanl.gov
Mon Jan 24 11:09:06 PST 2005


Jim Lux wrote:
> At 08:58 AM 1/24/2005, Josip Loncaric wrote:
> 
>> However, infant mortality can be a *serious* problem.  Once you 
>> install a bad batch of drives and 40% of them start to go bad within 
>> months, you've got an expensive problem to fix (in terms of the 
>> manpower required), regardless of what the warranty says.
> 
> 
> The Seagate documentation actually had some charts in there with 
> expected failure rates, by month, for the first few months.
> 
> [...]
> 
>> Until a better solution is found, we can only make educated guesses -- 
>> and share anecdotal stories about bad batches to avoid...
>>
> 
> Or, spend some time with the full reliability data and make a 
> "calculated" guess.

...but a "calculated" guess would not prevent one from getting hurt by a 
bad batch of drives.  I doubt that Seagate expected 40% infant 
mortality, yet this is precisely what I experienced with 
first-generation 7200rpm Seagate drives in my first cluster.

Any new design could have unexpected flaws, regardless of what the 
manufacturer's advertised reliability expectations are.  This is why 
actual reliability experience is so important -- and building community 
experience takes time (e.g. 6-12 months).  The good news is that by then 
formerly new products are considered mature and are priced more 
competitively.

So, we're back to the well-established rule: Staying a step behind the 
bleeding edge allows one to avoid design flaws in brand new products, 
and have more confidence in guesses calculated on the basis of 
manufacturer's reliability expectations.

Sincerely,
Josip







More information about the Beowulf mailing list