[Beowulf] Re: Cooling vs HW replacement
josip at lanl.gov
Mon Jan 24 11:09:06 PST 2005
Jim Lux wrote:
> At 08:58 AM 1/24/2005, Josip Loncaric wrote:
>> However, infant mortality can be a *serious* problem. Once you
>> install a bad batch of drives and 40% of them start to go bad within
>> months, you've got an expensive problem to fix (in terms of the
>> manpower required), regardless of what the warranty says.
> The Seagate documentation actually had some charts in there with
> expected failure rates, by month, for the first few months.
>> Until a better solution is found, we can only make educated guesses --
>> and share anecdotal stories about bad batches to avoid...
> Or, spend some time with the full reliability data and make a
> "calculated" guess.
...but a "calculated" guess would not prevent one from getting hurt by a
bad batch of drives. I doubt that Seagate expected 40% infant
mortality, yet this is precisely what I experienced with
first-generation 7200rpm Seagate drives in my first cluster.
Any new design could have unexpected flaws, regardless of what the
manufacturer's advertised reliability expectations are. This is why
actual reliability experience is so important -- and building community
experience takes time (e.g. 6-12 months). The good news is that by then
formerly new products are considered mature and are priced more
So, we're back to the well-established rule: Staying a step behind the
bleeding edge allows one to avoid design flaws in brand new products,
and have more confidence in guesses calculated on the basis of
manufacturer's reliability expectations.
More information about the Beowulf