[Beowulf] The Walmart Compute Node?

Fri Nov 9 10:00:14 PST 2007

On Fri, 9 Nov 2007, Larry Stewart wrote:

> The flaw in this argument is that a slower clock design can use the same 
> small transistors and the same current state of the art processes and it will 
> use many fewer transistors to get its work done, thus using very much less 
> power.  Our 1 GF core is 600 milliwatts, for example.
> Even after adding all the non-core stuff - caches, memory controllers, 
> interconnect, main memory, and all overhead, it is still around 3 watts per 
> GF.

I think that you have to take as granted (I did say it explicitly and in
some detail) "in a processor family".  So comparing apples to oranges
doesn't work, comparing apples to apples of roughly the same age and
stepping does work within reason, comparing old apples to brand new
remasked apples doesn't work.  I noted that a naive power scaling
argument applied to, say, the 8088 (or more fairly, the Pentium Pro)
would have modern processors using enough power inside a chassis sized
space to ignite a thermalnuclear fusion process, or something like that
(just kidding -- more like as hot as an electrical arc or thereabouts,
probably far away from the fusion threshold:-).  Lessee, 5 MHz to
roughly 5 GHz, a thousandfold, transistor count several orders of
magnitude up (don't really know how many, but the CACHE on a modern CPU
is many powers of two larger than the ENTIRE MEMORY of my first IBM PC,
which in turn was big compared to the e.g. IBM 5100.

So nobody would argue that a completely different processor, especially
a new design, might have completely different power features than one
being discussed.  Even comparing AMD's and Intel's binary compatible
processors is dicey.  Different is different.

However, within a processor family, within some reasonable morphs of the
original stepping and chip layout, higher clocks usually consume less
power per cycle than lower clocks both because of the more or less fixed
overhead and because if they didn't, you'd exceed the practical limit of
power dissipation Jim and I both mentioned (Jim in detail).  So as each
processor family ramps up clock, it does so subject to a number of
constraints -- has to fit on a socket in a standard packaging.  Has to
accept a fairly standard mass market cooler.  Can't generate more than X
watts (ballpark 100, although of course people shoot for "as little as
possible") per package.  Can't radiate too much EM energy.  Can't
cross-couple radiated energy (constraints on the internal layout,
sharpness of corners, length of parallel runs).  Probably lots more that
I don't know about.

>> In ADDITION to this is the fact that the processor has to live in a
>> house of some sort, and the house itself adds per processor overhead.
>> This overhead is significant -- typically a minimum of 10-20 W,
>> sometimes as much as 30-40 (depending on how many disks you have, how
>
> This factor does not scale this way!  With low power processors, you can pack 
> them together, without the endless support chips, you
> can use low power inter-chip signalling, you can use high efficiency power 
> supplies with their economies of scale.  If you look inside
> a PC there are two blocks doing useful work - memory and CPUs, and a whole 
> board full of useless crap.  Look inside a machine designed
> to be a cluster and there should be nothing there but cpus and memory.

On the contrary, it DOES scale this way.  Again, in context, there are
two general threads of discussion on list.  One of them is what we might
call "pro-grade clusters".  As somebody (Andrew?) pointed out, the
Walmart box is a laughable object to offer up as a pro-grade cluster.
It's packaged wrong from the beginning and nobody on list would
seriously consider it for a properly budgeted cluster paid for by OPM.
When I compare it to a UP dual core AMD 64 (instead of one of the 8-way
boxes Doug reviewed in the latest LM online or some other rackmount form
factor or blade or advanced/dedicated packaging) I'm fairly clearly
selecting the alternative, very common thread context:  "Beginner grade"
or "Homemade" or "Small special purpose" clusters.

Truthfully, all of these small clusters are closer to the original COTS
versions of beowulfs.  The original beowulf itself was a small mountain
of tower units on shelving purchased at a hardware store, and the Wal
Mart unit, had it existed, would have been right at home in it (and
indeed, would have cranked its numerical throughput right through the
roof compared to all those PENTIUMS).  I'm pretty sure it is what Peter
was looking for, as well -- I don't think he has a half-million to
invest in a cluster, I'm sorry to say.  More like a few thousand, and it
may even come out of his own personal pocket or a very small research
budget.  I build my home cluster out of a very similar set of
constrained resources, which is why I am interested and participating in
this discussion.

So sure, on interested in pro-grade clusters can buy a "machine designed
to be a cluster", but it isn't, really, a beowulf anymore because it is
almost by hypothesis not a COTS system any more.  Then you deal with the
vendors/manufacturer of that system and take what they give you,
generally at a huge markup (although there are a few hardy souls who do
tackle board-up design and hand-mounting COTS motherboards in various
ways, they generally either don't have enough to do or are doing it for
fun, because straight up COTS is usually cheaper if you assign nearly
ANY value to your own labor).

So yeah, you do have to live with COTS cases and power supplies, and pay
a certain fixed overhead for idle operation of a power supply regardless
of its load.  You do use a COTS motherboard, which probably does have
USB ports, a serial port or two, a video card, a network interface or
two that burn some juice even when nothing is plugged in and that you
don't really "need".  You may or may not build a system with per-case
disks -- depends on what the cluster is for, how skilled you are, and a
bunch of other stuff.  Sure you save some money with a diskless design
(both on hardware and power), but it is considerably harder for a novice
to ramp up to a functional cluster this way (although it is certainly
getting easier than it was when I ran my first diskless clusters, let me
assure you:-).  One day soon those "useless" USB ports may be the best
way to boot a cluster from a flash drive and have the best of both
worlds, or some clever soul will start selling motherboards with
integrated 16 GB flash memory that boot and run perfectly without a
physical disk -- one can EASILY boot a cluster image in a quarter of
that, and can boot a very nicely furnished desktop image in half of it.
It will probably still be hard to get COTS versions of this motherboard
without USB and this and that, though.

The point is that for MOST people building a more or less "standard COTS
cluster" according to the usual recipe, tower or rackmount as the case
may be, there are advantages in putting lots of processors in a single
case and sharing the power and cost overhead amongst them, if the
parallel scaling of your task, given its bottlenecks, permits.  There is
always SOME overhead, and your remark above is really "yes this is truer
than you think" not "that is wrong".  Neither of which is the case.

Depending on your skill and how carefully you shop and the details of
your task you may be able to do better or worse at the process of
putting the "right" number of cores into a single case to share the
overhead, but the rule of thumb that more cores per fixed unit of
overhead is better will still hold until (for your particular task mix
and set of available, realistic, COTS hardware ingredients) it doesn't.
The idea is to understand the rule of thumb, so that you can recognize
the exceptions, so that you can optimize your design as much as possible
for your budget, task mix, etc.

There are obviously tasks, for example, where putting as many cores as
possible into a single chassis on a single motherboard is not the best
thing to do.  If you have a fine-grained parallel task that needs every
bit of a high speed communications channel for just one processor core,
then putting two or four cores in a box that perforce shares that
channel to get to the other umpty motherboards in their own bottlenecked
cases is not going to be productive no matter what the power savings or
nominal aggregate FLOPS/dollar.  (And yes, even this may not be
perfectly true in even this case if the overhead of the communications
channel itself can be parallelized if the system has two cores, blah
blah blah.)

    rgb

>
>
>

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977