[Beowulf] 96 cores in silent and small enclosure

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Tue Apr 13 06:40:12 PDT 2010

On 4/12/10 11:18 PM, "Jon Tegner" <tegner at renget.se> wrote:

>>  I think a fair amount of study is needed to really understand the thermal
>> management of these  devices.  In many ways, doing it for a modern processor
>> is like doing it for a whole PC board  with lots of parts.  You've got
>> different functional blocks, all running at different speeds, some  enabled,
>> some disabled, so you can't just have a single "keep the case at point X
>> below temp  Y".
>> ***************************************************************************
>> Thanks for the information! Lets see if I understand this correctly:
>> * The temperature reported to bios is the Tctl-temperature?

I don't think so.  Tctl is a "design reference" of some sort.  The BIOS
reports the temperature of some sensor at some point on the chip.  The
relationship between that temperature and the limits is defined in that
document (and it's not a fixed relationship, apparently).

>> * This "temperature" is non-physical, but the number is designed to be
>> relevant to the cooling requirements of the CPU. That is, if this number is
>> larger than Tctl Max, the cpu take corrective actions, e.g. throttling down?
>> * If this number (Tctl) is below Tctl Max the chances are high that the cpu
>> will live a happy life for many years? It would be stupid of AMD to not have
>> designed this number with some margin to account for different cooling
>> situations.


The mfr comes up with some strategy for setting temperature limits, based on
what they think will be acceptable life in some acceptable installation
running some typical instruction stream. The "design reference" installation
and instruction stream probably is different for processors targeted to
different markets (e.g. Consumer laptop vs consumer set-top-box vs rack
mounted server farm).  The PC business is horribly cost sensitive, and a few
pennies for a different fan or an extra piece of sheet metal to make the
processor 2 degrees cooler makes a difference.  Just because the horde of
small PC manufacturers, in general, don't have good thermal designers, Intel
and AMD actually provide a "reference thermal design" including fan
size/speed, duct design, etc.; just like they provide reference mobo designs
for the electrical aspects.  A big company like HP or Dell has enough volume
to design their own cases, etc.; they also sell desktops and laptops into
big corporate accounts, which are slightly more sensitive to issues like
life than consumers.

So, in reality, it's smart of AMD to run right to the ragged edge, at least
for consumer oriented parts.  Most consumers will NOT run 100% duty cycle
and will NOT run their computers in 40C air and will NOT be sensitive to a
few months shorter life (at least in the aggregate).  Given that the
warranty term on most computers is no greater than a year, a 2 year design
life might be reasonable.

The design and use model is very very different from an infrastructure,
industrial (or space) application where long life is an important design
concern. In those kinds of applications, you'll see a lot more attention to
derating, conservative temperatures, and actually understanding the failure
mechanisms.  There's a fairly easily ascertained economic value to having to
deal with a failed network switch or server.  If that switch is handling
millions of dollars of financial transactions, the downtime cost is pretty
high.  The downtime cost for a consumer PC, after the warranty has run out,
is pretty darn low.


More information about the Beowulf mailing list