[Beowulf] Cooling vs HW replacement

George Georgalis george at galis.org
Mon Jan 17 08:01:45 PST 2005


On Mon, Jan 17, 2005 at 10:56:20AM +0100, Ariel Sabiguero wrote:
>
>>If you have 3% failure at 65F at 3 years, and 15% failure
>>at 80F at 3 years,
>>
>What is the source for that figures?
>Of course that if it is the case even 80F is too much.
>

I thought it was clear, I just made up those numbers to illustrate,
the hypothetical situation of different temperatures: 10% failure at 3
years means 0.01% chance of failure every day (10% / 3 / 365). Failures
are not necessarily skewed at the end of the period, but could be evenly
distributed, or skewed toward the beginning of the period (which I think
is most often the case, some hardware just lasts while others fail at 6
months).

My guess is that at higher temperatures, failures will be evenly
distributed across the time period, causing continual maintenance issues
-- which are more easily addressed with disk failures, than mainboard
and/or cpu.

Also, I should clarify, I've not setup a site like this, by experience I
really meant exposure. I know the hot room and cold room setup does make
a difference though.

It may well be advantageous to use slow CPU (ie 1.2 Mhz, and possibly
under-clocked) for raid systems in a hot room, to help preserve them.
Power supplies vary wildly in quality and efficiency.  The point about
them being the limiting factor in high temperatures is well taken.

The new macmini advertises operating temperature: 50° to 95° F (10° to
35° C) http://www.apple.com/macmini/specs.html (that may not be for
continous operation) and their design would make it easy to gang several
units on one specially designed power supply (for efficiency). I'm not
recommending, I don't know anybody who has touched one, but I would look
into them; they do run Linux, I think.

// George


-- 
George Georgalis, systems architect, administrator Linux BSD IXOYE
http://galis.org/george/ cell:646-331-2027 mailto:george at galis.org



More information about the Beowulf mailing list