[Beowulf] Cooling vs HW replacement

Robert G. Brown rgb at phy.duke.edu
Mon Jan 17 10:44:15 PST 2005


On Sun, 9 Jan 2005, Ariel Sabiguero wrote:

> Hello all.
> The following question shall only consider costs, not uptime or 
> reliability of the solution.
> I need to balance costs of hardware replacement after failures over air 
> conditioning costs.
> The question arises as most current hardware comes with 3 or more years 
> of warranty. During that period of time Moore twofolded twice hardware 
> performance... is it worth spending money cooling down a cluster or just 
> rebuilding it after it "burns out" (and is at least 4 times slower than 
> state-of-the art)?
> Is it worth cooling down the room to a Class A Computer room standard or 
> save the money for hardware upgrade after three years? In warm countries 
> keeping 18ºC the air inside a room (PC-heated) when outside temperature 
> is 30ºC average it becomes pretty expensive to pay electricity bills. It 
> is cheaper to "circulate" 30ºC air and have from 40-50ºC inside the chassis.

If you circulate 30C air, and have 50+C air inside the chassis, the CPU
and memory chips themselves will be at least 10 and more likely 30 or 50
C hotter than that.  This will really significantly reduce the lifetime
of the components.  There is a rule of thumb that every 10F (4.5 C)
hotter ambient air temp reduces expected lifetime by a year.  You're
talking about running some 3x10 F degrees hotter than optimal for a 4+
year lifetime.  This could easily reduce the MTBF for your nodes to 1-2
years.

However, this "lifetime" thing is going to be highly irregular.  All
chips are not equal.  Some subsystems, especially memory, will flake out
(give you odd answers, drop bits) if you habitually run them well above
desireable ambient.  Some will run for four months, flake out, then
break at six months.  Some will run for a year and pop.  Some will make
it to two years, and only a relatively small fraction of your cluster
will make it to years 3-4.

It is therefore not possible to address "only the costs" without
addressing uptime and reliability.  Downtime is expensive.  Downtime due
to a crash can cost you a week's worth of work for the entire cluster
for some kinds of problems.  Unreliable hardware is AWESOMELY expensive,
I know from bitter, personal experience.  In addition to the associated
downtime, there is all sorts of human time associated with going into
the cluster every week or two to pull a downed node, work with it
(sometimes for a full day) with spares to identify the blown components,
order and replace the blown component, and get it back up.  A minimum of
say 4 hours per event, and as much as 2-3 DAYS if something isn't broken
but is just too flaky -- the system crashes (because of memory running
too hot) but it reboots fine when it is cooled and you can't identify a
"bad chip" because there isn't one, technically, except when it is under
load AND being "cooled" by hot air.

Time costs money -- generally more money than either the hardware OR the
air conditioning.

Besides, AC costs are still only about 1/3 the costs of powering the
nodes up themselves as a running expense (depending on the COP of your
cooling system, assuming a COP of 3-4).  The rest is infrastructure
investment in building a properly cooled facility.  I'd say make the
investment.

BTW, you might well find that hardware salespersons will balk at
replacing the equipment they sell you under extended service if you
don't maintain the recommended ambient air.  So you might end up having
to pay for a constant stream of hardware out of pocket in addition to
the labor and downtime.  I just don't think it is worth it.

  rgb

> 
> Do you have figures or graphs plotting MTBF vs temperature for main 
> system components (memory, CPU, mainboard, HDD) ?
> Links to this information are highly appreciated!
> I remember old (40MB RLL disks shipping this information with the 
> device, several pages of  printed manual) hardware showing the 
> difference in MTBF vs environment conditions, but nowadays commodity 
> harware does not consider this on the sticker on the top of the device...
> 
> Regards
> 
> Ariel
> 
> PS: if the idea is worth the money, then I would like to study 
> reliability and uptime, but it is not the main concern now.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list