[Beowulf] 96 cores in silent and small enclosure

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Sun Apr 11 20:58:38 PDT 2010




On 4/11/10 7:59 PM, "Mark Hahn" <hahn at mcmaster.ca> wrote:
> 
>> * How high cpu temperatures are acceptable (our cluster is built on 6 core
>> AMD opterons)?
> 
> well, you can look up the max operating spec for your particular chips.
> for instance, http://products.amd.com/en-us/OpteronCPUResult.aspx
> shows that OS8439YDS6DGN includes chips rated 55-71.  (there must be some
> further package marking to determine which temp spec...)
> 

I couldn't find the datasheet in a few seconds of casual clicking, BUT...

The temperature might be related to the clock rate you're running at... A
faster clock rate  or higher dissipation power might have a lower
temperature limit (or might not)..

For instance, if the limit is the junction temperature, there's some thermal
resistance between the reference junction and the measurement point, so if
the chip is dissipating more, the delta T between limiting point and
measurement point is greater.

The limits might also have to do with timing constraints.  The timing
margins of most semiconductor circuits change pretty substantially with
temperature, and what works at a given speed at one temperature might not
work hotter or cooler.  (and a lot of times, those limits might be
determined empirically... They test a bunch of cases, and that's what gets
published in the data sheet)

There's also the whole "instruction stream" effect on the thermal
properties.  An instantaneous dissipation change of 10:1 isn't unusual,
especially if you have onchip cache and pipelining.

> 
> 
>> I know life span is reduced if temperature is high, but due to
>> performance reasons life span of a CPU is pretty short anyway.
> 
> if you operate the chip within spec, you should expect the lifespan
> to be plenty long (basically indefinite, but let's say 10 years...)

Maybe, maybe not.  The chip life generally follows Arrhenius rule (roughly
halving life for 10C rise), but it's hard to know what the "rated" life is,
and whether the exponent is the same.  And, of course, you're probably not
running the thing at max junction temp all the time.  When they test chips
for life, they do accelerated aging testing.. They do some examples (based
on the packaging and fab process and experience) to figure out a scaling
law, then run them really hot, to get "effective" rates of aging that are
very high (so you can get years of "life" in a month or so).  But it's
really an art, and sort of a crap shoot anyway.  There have been lots of
cases where things didn't follow the rules, and unexpected things happened.

>> * Would there be a market potential for a system like this? I naturally tend
> 
> the more specialized the product, the smaller the market.  there are lots
> of mainstream workstations which are fairly quiet.  I've even seen some
> small deskside clusters that claimed to be quiet.  personally I don't
> think it makes much sense - I'd rather use an arbitrarily-noisy cluster
> from a quiet and wimpy desktop.
> 

The usual argument for deskside clusters is that they are under your
personal control, and you don't have to justify your use (or non-use) of
them at any given time. They're "personal" as opposed to "corporate
resource".





More information about the Beowulf mailing list