[Beowulf] Re: Cooling vs HW replacement

Josip Loncaric josip at lanl.gov
Mon Jan 24 08:58:47 PST 2005

Jim Lux wrote:
> Actually, I'd trust the MTBF and other reliability data more than the
> warranty, and here's why:

I agree -- but I wished I had two more numbers: percentage lost to 
infant mortality, and possibly the overall life expectancy.  This would 
describe the "bathtub" failure rate graph in a way that I can apply in 
practice, while MTBF alone is only a partial description.

Life expectancy for today's drives is probably longer than the useful 
life of a computer cluster (3-4 years, but see below).  Therefore, 
midlife MTBF numbers should be a good guide of how many disk 
replacements the cluster may need annually.

However, infant mortality can be a *serious* problem.  Once you install 
a bad batch of drives and 40% of them start to go bad within months, 
you've got an expensive problem to fix (in terms of the manpower 
required), regardless of what the warranty says.

Manufacturers are starting to address this concern, but in ways that are 
very difficult to compare.  For example, Maxtor advertises "annualized 
return rate <1%" which presumably relates to the number of drives 
returned for warranty service, but comparing Maxtor's numbers to anyone 
else's is mere guesswork.

Even if manufacturers were to truthfully report their overall warranty 
return experience, this would not prevent them from releasing a bad 
batch of drives every now and then.  Only those manufacturers that 
routinely fail to meet industry's typical reliability get reputations 
bad enough to erode their financial position -- so I suspect that 
average warranty return percentages (for surviving manufacturers) would 
turn out to be virtually identical -- and thus not very significant for 
cluster design decisions.

Until a better solution is found, we can only make educated guesses -- 
and share anecdotal stories about bad batches to avoid...


P.S.  Drives are designed for particular markets: expensive server 
drives (->SCSI) are designed to be worked hard 24/7 and rarely spun 
down; cheap desktop drives (->ATA) are designed for light workloads 
10-12 hr/day and more start/stop cycles.  Their respective MTBF figures 
assume these different workloads.  Moreover, target component lifespan 
for cheap drives is 5 years minimum, so this should describe their life 
expectancy -- assuming that a particular batch does not have a design 
defect creating high infant mortality.

If a cluster is good for 3-4 years and its drives for 5, there will be 
some rise in the number of drive replacements needed towards the end, 
but probably still within reason.  This is as it should be: it makes no 
economic sense to overdesign components which will be replaced after 3-4 
years anyway.  Mature consumer products usually reach this balance of 
component reliabilities.  We all know what happens with cars: they work 
for years with modest maintenance, but then all seems to go wrong at 
once, and it's time to get a new one.

More information about the Beowulf mailing list