Monoprocessor o Bi-processor nodes?, K7 or Pentium III?

Thu Sep 21 10:12:13 PDT 2000

On Thu, 21 Sep 2000, [Windows-1252] Carlos J. García Orellana wrote:

> Hello,
> 
> We are going to buy a cluster (around 20-25 nodes), and we have some
> questions.
> 
> * What is better, multiprocesor motherboard o monoprocessor. Take in account
> that we have money to buy 20-25 processors.
> 
> * And, K7 (and then monoprocessor) or Pentium III?

I cannot tell you what the optimum solution for your needs will be, but
I can tell you a bit about how to find it.

If possible, benchmark your primary application(s) on both single and
dual CPU systems.  What you are looking for is the drop off in dual
performance from single performance.  For example, if it takes ten
minutes to run on a single processor of the dual (with the dual
otherwise idle, or better on a single processor system at the same clock
and similarly configured) and eleven minutes to run when you run two
instances of the job at the same time on the dual, you've experienced
around a 10% degradation of performance running your application in
parallel on a dual.  If you ran the job in parallel on two single
processor systems, you could expect to complete in 10% or so less time
than running it in parallel on a dual.

Some jobs will complete in pretty much exactly the same time running one
processor or two processors at a time.  Others will not.  The issue is
the extent to which the two jobs compete for single system resources,
which can be bottlenecked when subjected to the full-speed demands of
two independent jobs.  If your CPU-only is bottlenecked on a dual, it
likely means that your job is memory intensive and both CPUs are trying
to interact with memory at once (so one has to wait a bit).  Don't try
to second guess the benchmark -- nearly all jobs interact a lot with
memory, but this is VERY dependent on just how often the cache has to
get filled, how many instructions are executed between cache fills, how
fast a memory request is filled, and that sort of thing, all of which
can vary wildly between systems and even between runs of a single job on
a single system at different sizes.

Next, try to figure out the pattern and time likely to be required for
interprocessor communications in your parallel application.  Again it is
ideal to measure this, if you have a few "prototyping" nodes around to
play with on a network like the one you plan to use in the 'wulf.  If
you are prototyping, you run the job on small clusters of single and
dual nodes.  Again, you are looking for performance drop off caused by
contention on a bottlenecked resource, in this case the network.  If
both CPUs on a dual need to use the one ethernet port to communicate
with two other nodes at the same time, one must wait a bit and overall
performance can drop off.  On the other hand, communications between the
two CPUs on a single node is likely significantly faster (bandwidth and
latency) than communication between nodes connected by ethernet.

As before, there are significant tradeoffs to consider, but now they are
REALLY a mess to untangle if your problem isn't "simple" (coarse
grained, embarrassingly parallel, or maybe master-slave).  The
communications load can shift tremendously depending on the topology of
the connections, the way the code is written, the algorithms used for
communication and more.  If your problems are "complex" in their
parallel communications, you should study parallel computation in
application to your particular problem before deciding on BOTH a node
AND a network, as there are cases where you should invest (sometimes
far) more in one at the expense of the other in order to get the most
work done in the least time for your money.

Then, when you understand the performance tradeoffs quite well
(including the CPU speed and clockspeed/price tradeoff) you can
intelligently pick the "best" architecture in the sense that you
literally get the most for your money, with a rational basis for what
you do.

That said, some limiting cases.  These are just my opinion, and may not
all be correct as some of them are based on things I've learned on this
list and haven't directly experienced myself.

If your code really is embarrassingly parallel and is not memory (or
network, by definition) bound, the Celeron (really the dual Celeron) has
been a price-performance leader.  In a dual package, you save a bit on
chassis, network, and disk (if a disk is needed) and might get by with a
bit less runtime memory per CPU as well (as the two CPUs share an OS
image).  However, the marginal cost of the cheap PIII's (at similar
clock) isn't TOO great, and for some applications the larger and more
intelligent cache of the PIII might easily justify the cost
differential.

The microbenchmarks I've done with the Athlon vs PIII suggest to me that
the Athlon is greased lightning running out of L1 cache but slows WAY
down to slightly worse at equivalent clock than a PIII.  It's enough
cheaper that it is probably a small price-performance winner at the
higher clock speeds (where Intel CPUs are absurdly costly) but I'm not
impressed at lower clocks.

The microbenchmarks I've done with alphas suggest that they are very
rarely going to be price/performance leaders for single-threaded
(embarrassingly parallel, CPU bound) code.  You can get a whole lot of
Celerons for the cost of a single alpha.  This may be changing -- Compaq
sounds like they are trying to move the alpha closer to the
price-performance range of the Intel CPUs, which is useful as the Intel
and Athlon CPUs start getting to significantly higher clocks.  Alphas do
have other performance advantages though, addressed next.

If you are running moderately complex parallel code, I don't know what
you should get.  That's where the right answer will probably be a LOT
more cost effective than a wrong answer and there's no easy way to find
the right configuration.

If you are running tightly coupled code with fine grain parallelism,
from what I've learned on the list and at talks you should indeed think
very seriously about alphas, specifically alphas interconnected with a
very high speed network e.g. Myrinet.  High end alphas have a lot of
technical advantages that keep them from being bottlenecked as early and
as often as the (more mass-market oriented) Intel architectures, and
nonlinear drop-offs associated with bottlenecks are what ultimately kill
your parallel scaling.

If you are in either of the latter two categories (moderate grained or
fine grained, complicated communications patterns) you will either have
to work pretty hard or consider getting consultative help in your
beowulf design to really optimize it.  It might be worth paying the
margin a turnkey beowulf provider would charge just to get the
consultation and problem analysis services they can often provide as
part of their package.  If you MUST do it yourself (as your budget
sounds pretty limited) you should probably invest some of it in a bunch
of books on beowulf design and parallel programming, and learn how your
parallel algorithm and system design parameters will interact to provide
your ulimate parallel performance scaling.

Hope this helps,

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu