another radical concept...Re: [Beowulf] Cooling vs HW replacement

Tue Jan 18 12:52:50 PST 2005

OK.. we're all agreed that running things hot is bad practice.. BUT, it
seems we're all talking about "office" or "computer room" environments on
problems where a failure in a processor or component has high impact.

Say you have an application where you don't need long life (maybe you've got
a field site where things have to work for 3 months, and then it can die),
but the ambient temperature is, say, 50C.  Maybe some sort of remote
monitoring platform system.

You've got those Seagate drives with the spec for 30C, and some small number
will fail every month at that temp (about 0.5% will fail in the three
months).  But, you'll have to go through all kinds of hassle to cool the 50C
down to 30C.

Maybe your choice is between "sealed box at 50C" and "vented box at 30C", in
a dusty dirty environment, where the reliability impact of sucking in dust
is far greater than the increased failure rate due to running hot.

You just run at 50C, accepting a 10 times higher (or maybe, only 4-5 times
higher) failure rate.  You're still down at 5% failure rate over the three
months.  If you've got half a dozen units, and you write your
software/design your system so you can tolerate a single failure without
disaster, and you might have a cost effective solution.

Yes, it requires more sophistication in writing the software. Dare I say,
better software design, something that is fault tolerant?

There's also the prospect, not much explored in clusters, but certainly used
in modern laptops, etc. of dynamically changing computation rate according
to the environment. If the temperature goes up, maybe you can slow down the
computations (reducing the heat load per processor) or just turn off some
processors (reducing the total heat load of the cluster).  Maybe you've got
a cyclical temperature environment (that sealed box out in the dusty
desert), and you can just schedule your computation appropriately (compute
at night, rest during the day).

This kind of "resource limited" scheduling is pretty common practice in
spacecraft, where you've got to trade power, heat, and work to be done and
keep things viable.

There are very well understood ways to do it autonomously in an "optimal"
fashion, although, as far as I know, nobody is brave enough to try it on a
spacecraft, at least in an autonomous way.

Say you have a distributed "mesh" of processors (each in a sealed box), but
scattered around, in a varying environment.  You could move computational
work among the nodes according to which ones are best capable at a given
time.  I imagine a plain with some trees, where the shade is moving around,
and, perhaps, rain falls occasionally to cool things off.  You could even
turn on and off nodes in some sort of regular pattern, waiting for them to
cool down in between bursts of work.

People (perhaps, some are even on this list) are developing scheduling and
work allocation algorithms that could do this kind of thing (or, at least,
they SHOULD be).  It's a bit different than the classical batch handler, and
might require some awareness within the core work to be done.  Ideally, the
computational task shouldn't care how many nodes are working how fast, or
which nodes, but not all applications can be that divorced from knowledge of
the computational environment.

Jim Lux
Flight Communications Systems Section
Jet Propulsion Lab

----- Original Message -----
From: "Karen Shaeffer" <shaeffer at neuralscape.com>
To: "Josip Loncaric" <josip at lanl.gov>
Cc: <beowulf at beowulf.org>
Sent: Tuesday, January 18, 2005 10:59 AM
Subject: Re: [Beowulf] Cooling vs HW replacement

> On Tue, Jan 18, 2005 at 09:30:03AM -0700, Josip Loncaric wrote:
> > At my old job, we had the unfortunate experience of AC failing on the
> > hottest days of the year.  Despite providing plenty of circulating fresh
> > 35-40 deg. C air, we lost hardware, mainly disks.  In fact, we'd start
> > losing hard drives (even high quality SCSI drives in our servers) any
> > time the ambient temperature approached 30 deg. C.
>
> Hello,
>
> I would certainly agree with the assertion that disk drive MTBF has
> a strong, nonlinear dependency on operating temperature. While I have
> not run disks at out of spec temperatures, I did work at Seagate for a
> few years, where I learned of this very strong dependence. This thread
> began with the assertion that you do not need to cool disks, but I
> think this is a very ill-advised strategy.
>
> YMMV,
> Karen
> --
>  Karen Shaeffer
>  Neuralscape, Palo Alto, Ca. 94306
>  shaeffer at neuralscape.com  http://www.neuralscape.com
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf