[Beowulf] Re: Cooling vs HW replacement
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Josip Loncaric josip at lanl.govMon Jan 24 08:58:47 PST 2005
- Previous message: [Beowulf] Re: Cooling vs HW replacement
- Next message: [Beowulf] Re: Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Jim Lux wrote: > > Actually, I'd trust the MTBF and other reliability data more than the > warranty, and here's why: I agree -- but I wished I had two more numbers: percentage lost to infant mortality, and possibly the overall life expectancy. This would describe the "bathtub" failure rate graph in a way that I can apply in practice, while MTBF alone is only a partial description. Life expectancy for today's drives is probably longer than the useful life of a computer cluster (3-4 years, but see below). Therefore, midlife MTBF numbers should be a good guide of how many disk replacements the cluster may need annually. However, infant mortality can be a *serious* problem. Once you install a bad batch of drives and 40% of them start to go bad within months, you've got an expensive problem to fix (in terms of the manpower required), regardless of what the warranty says. Manufacturers are starting to address this concern, but in ways that are very difficult to compare. For example, Maxtor advertises "annualized return rate <1%" which presumably relates to the number of drives returned for warranty service, but comparing Maxtor's numbers to anyone else's is mere guesswork. Even if manufacturers were to truthfully report their overall warranty return experience, this would not prevent them from releasing a bad batch of drives every now and then. Only those manufacturers that routinely fail to meet industry's typical reliability get reputations bad enough to erode their financial position -- so I suspect that average warranty return percentages (for surviving manufacturers) would turn out to be virtually identical -- and thus not very significant for cluster design decisions. Until a better solution is found, we can only make educated guesses -- and share anecdotal stories about bad batches to avoid... Sincerely, Josip P.S. Drives are designed for particular markets: expensive server drives (->SCSI) are designed to be worked hard 24/7 and rarely spun down; cheap desktop drives (->ATA) are designed for light workloads 10-12 hr/day and more start/stop cycles. Their respective MTBF figures assume these different workloads. Moreover, target component lifespan for cheap drives is 5 years minimum, so this should describe their life expectancy -- assuming that a particular batch does not have a design defect creating high infant mortality. If a cluster is good for 3-4 years and its drives for 5, there will be some rise in the number of drive replacements needed towards the end, but probably still within reason. This is as it should be: it makes no economic sense to overdesign components which will be replaced after 3-4 years anyway. Mature consumer products usually reach this balance of component reliabilities. We all know what happens with cars: they work for years with modest maintenance, but then all seems to go wrong at once, and it's time to get a new one.
- Previous message: [Beowulf] Re: Cooling vs HW replacement
- Next message: [Beowulf] Re: Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
