Dual Athlon MP 1U units

Robert G. Brown rgb at phy.duke.edu
Sun Jan 27 13:08:04 PST 2002


On Sun, 27 Jan 2002, Velocet wrote:

> On Sat, Jan 26, 2002 at 06:20:10PM -0500, Robert G. Brown's all...
> > On Sat, 26 Jan 2002, W Bauske wrote:
> > 
> > That's basically why I worry about 1U duals.  In principle they'll work
> > -- keep the outside air cool, pull as much cold air through the cases as
> > you can possibly arrange, keep the air clean (so the fans don't clog),
> > monitor thermal sensors and kill if they start getting too hot.  You can
> > see, though, that they are a design that taunts Murphy's Law.  Not too
> > robust.  A little thing like an AC blower motor that blows a circuit
> > breaker at 3 am can reduce your $65K rack of hardware to a pile of junk
> > in the thirty minutes it takes you to find out and do something about
> > it, if you don't have fully automated (and functioning) shutdown setup.
> 
> This sounds like you shouldnt have closed boxes at all - why not much more
> open cases instead, so that in case some big critical fan somewhere does shut
> down, then you arent risking a meltdown of your entire cluster... if its 
> at least slightly open to the air of the room its in, hopefully regular 
> convection or other air currents would be enough to keep things cool.

Not completely clear that this is true if your OVERALL goal is to keep
systems optimally cool while running and survive the most common
failures.

  a) The cases we use have several fans.  If one dies, the other(s) can
perhaps keep it cool ENOUGH until we identify the unit, pull the box,
and replace the fan.
  b) The key thing is the pattern of flow of cool air over the hot parts
of the box -- the CPU heat sinks, the memory sticks, the power supply
itself.  A closed box can actually have a more predictable airflow
pattern than an open one.

However yes, a big case has more air and metal and hence a higher heat
capacity all by itself.  It has more surface area through which to lose
heat.  It will generally achieve steady state dissipation (for a given
power consumption level) at a lower temperature than a smaller case.
That temperature will often still be too high for safe operation,
although I've had a CPU survive in a big case for quite a long time
with its CPU fan dead (but its case ventilation fan still working).

Ditto with the room itself the systems are in.  In a closet sized space,
loss of cooling is a fifteen minute disaster.  A few thousand watts
turns it into an oven (over 40C) in no time, and CPU core temperatures
will be considerably hotter than ambient air.  If the same systems are
in a warehouse-sized volume, the amount they heat the space may be
barely discernible above the background temperature with no AC at all.
Bigger rooms, with lower system density are lovely if you can afford
them.

> This makes a case (*ahem*) for a thermal power switch placed inside the rack -
> if its 50C (or whatever) in the rack, its time to cut the power - I am sure
> these things exist and shouldnt be too expensive. Anyone using them?

Better, if your motherboards have onboard thermal sensors (e.g.
lmsensors) then it is fairly easy to monitor them and have a system turn
itself off when various criteria are exceeded.  lmsensors often provide
outputs of fan speeds and line voltages as well as temperatures.  There
are kernel modules that dump their output to files in /proc.

The bad thing is that their use is very spotty by motherboard
manufacturers and that the lmsensors folks have the API from hell
(sorry, but there it is).  They tend to provide a different /proc for
each possible sensor chip and arrangement, and they're a lot of them.
This makes it difficult to write a stable portable app that relies on
the interface.

Still, if you have them and can figure them out, it is easy to put a
cron script on them to do whatever you like.

>  
> > Not that a stack of 2U duals is MUCH better.  It's still hot -- we have
> > 1800 XP's and probably will have more like 150-160W/box.  If we only put
> > 12 per rack, though, we can leave gaps between the cases and get some
> > cooling from the surfaces of the cases and in any event the cases have
> 
> In case the fans in the case fail, you mean...?

In general.  Cooler isn't just to prevent immediate failure -- it also
extends the expected life of the components.  Cooler is better.

> > much larger air volumes, more room for air to flow through, and more
> > room for bigger fans.  With luck we'll have SOME time to react (or for
> > our automated sentries to react) if the room AC fails and the power
> > doesn't.
> 
> Why not custom mount a large number of boards in a common space with
> a similar number of fans? Then if 1 or 2 (or half of the) fans fail,
> there arent 1 or 2 or more boards risking burnout due to 0 cooling - 
> instead, all the boards involved in that enclosure are sharing half
> the cooling - half being better than none, and half being great when you
> put in 3 times the airflow than was actually required. Im sure people
> have though of this before, and there's a reason why its not more
> popular. Just wondering what all your experience is out there.

Sure, but then you're building things yourself, and you have to figure
out the airflow pretty carefully.  If you're doing large clusters or
lots of clusters, it's hard to argue for custom mounts.  Standard racks,
standard xU cases, standard heavy duty shelving, standard tower cases --
makes it easy to assemble and easy to fix.  Sometimes costs a little
more, but sometimes not!

I've had dreams of mounting a stack of vertical motherboards, for
example, in a standard OTC two drawer file cabinet with the bottoms
knocked out of the drawers.  You could mount fans in the very bottom, the
power supply in the top, drill holes through the sides, wire it all up
neatly.  One could probably mount perhaps four or five motherboards per
cabinet.  By the time you add up the cost and the time, though, it's
"cheaper" to get a rack and regular cases (presuming your time has
value), and a LOT cheaper to put four units onto a shelf or stack them
up on the floor.  Grander dreams involve mounting motherboards
vertically on heavy duty shelving, but again it's more for hobbyists
than for serious clusters, unless somebody has lots of time and no money
at all.

Even companies that have designed and build specialized cluster cases
not unlike these designs (e.g. Alta computers) haven't exactly exploded
-- they sell at rates that are comparable to standard rackmounts (per
motherboard/case) but aren't as easy to manage in pieces or mix with
other rackmount components like switches.

> In 3 years we'll hopefully have CPUs that burn 10W at 5GHz instead! :)

Yeah.  But (taking the issue seriously) at the same time there will be
10 GHz CPUs that burn 150W at the bleeding edge, and folks on this list
with large scale high density clusters will still be moaning about heat.
Historically the bleeding edge runs hot, because if it DIDN'T they'd up
the transistor count, add some more FPE pipelines, add some more cache,
until they did.  Heat dissipation is one of the things that defines the
boundary of CPU design.

Unless somebody works out true optical CPUs, of course, or there is a
breakthrough in physics that renders heat irrelevant.

    rgb

> 
> /kc
> 
> > 
> >    rgb
> > 
> > -- 
> > Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> > Duke University Dept. of Physics, Box 90305
> > Durham, N.C. 27708-0305
> > Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> > 
> > 
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list