[Beowulf] dual-core benefits?

Robert G. Brown rgb at phy.duke.edu
Thu Sep 22 10:52:41 PDT 2005


Tahir Malas writes:

> Hi everyone,
> 
> I would like to take advice for the processor selection for the cluster that
> we will configure soon. Comparing the sequential performance of our programs
> on an Opteron 246 and a much more expensive machine with Itanium Processor,
> we have decided to use opteron processors with Tyan mbs. However, we are in
> a confusion to decide on the processor selection. Before posing my
> questions, I'd better give some info about our application requirements:
> 
> 1. The scalability of our program is not so good, less then 20 for 32 nodes
> (measured on a single node system). So we don't plan to go beyond 16 nodes.
> (which makes 32 processors due to dual-node usage)
> 
> 2. Memory requirement is huge; we will use 4GB memory per node for the time
> being and increase this to 16 GB later. So wee need fast CPUs and efficient
> usage of memory.
> 
> 3. Due to budget limitations we will first configure 8-node system with 4GB
> RAM per node and extend this to a 16-node system with 16-GB of RAM in 6
> months.
> 
> We were thinking of AMD 250 processors, but now the benchmarks of dual-core
> CPUs (on the web site of AMD) seems encouraging, and the cost of dual-core
> AMD 275 seems to be less then twice of AMD 250. Since the memory cost of our
> system will dominate other costs, we can afford to pass to dual-core
> technology. However, the questions that arise are follows.

The first thing you have to do is identify WHY the scaling of your code
isn't so good -- 20 for 32 nodes.  It's good that you've run benchmarks,
but you also have to do some deeper probing on the basis of those
benchmarks to optimize your cluster engineering.  You're also being wise
doing first 8 nodes and then 16, independent of budget.  Flaws in your
design that reveal themselves can be fixed, maybe in the second round
purchase.

> 1. Will it worth? And can we gain any advantages over single-core with the
> not-so-good scalability of our parallel programs? 

This is one of those questions that cannot be answered from the data
given, and maybe not from data you have in hand, but it can be answered.

First -- what is bottlenecking the parallel process?  Memory access
speed?  Network IPCs?  Local computation?  A combination of the three?
In particular you are interested in what is causing the fall-off from
linear scaling -- as you run the job on more and more nodes, those nodes
are spending more time communicating (for example) per unit of
computation.

If you are bottlenecked at the network, adding more processing cores
(and trying to use them) can actually SLOW DOWN your computation --
effectively taking you to 64 nodes and doubling the burden on your
already overloaded node network.

If you are bottlenecked at the network, you should also look hard at
your expenditure pattern.  I'm assuming that you're using gigabit
ethernet, as the cheapest mass-market network with decent bandwidth
available for this range of nodes.  However, there are much faster and
more efficient networks available.  Some of them are expensive enough
that they will "cost you nodes" -- you'll have to get fewer nodes and a
better network -- but they may restore your application scaling to close
to linear.  If you could equip your nodes with a faster network and keep
scaling linearly across this regime, it would be worth it to spend up to
six nodes to do it, for example, out of sixteen, as 10 (dual) nodes
would still yield a performance of about 20, instead of requiring 16
nodes to get the same 20.  OTOH, managing only 10 nodes is cheaper and
easier, providing power and AC to only 10 nodes is cheaper than it is to
provide it to 16 (estimate -- six nodes at ~200W each per year is $1200
US dollars a year for power and AC alone) -- with Amdahl's law, usually
the fewer nodes you buy the better many things are.

If you are bottlenecked at the network, consider e.g. Myrinet, SCI,
infiniband.  Most of the high end networks will cost order of $1000-1500
per node (IIRC) which works out just about right on the get 10 nodes
with the high end networks instead of 16 without.  Obviously, if you are
using 100BT networks for any reason you should at LEAST use gigabit
ethernet and on Opterons should probably use both gigabit interfaces per
motherboard so each CPU has its own network before moving on.  Also
investigate the possibilities of RDMA ethernet adapters if your task
involves moving data WHILE you are computing so that data movement over
the network is blocking your application.  One of the major benefits of
the high end networks is that they all tend to enable DMA data transfer
so that network transactions can complete while the CPU cranks on
something else.

If you are memory bound you are STILL likely not to see much benefit
from multicore CPUs, although this depends somewhat on the actual
pattern of memory utilization.  With some effort, some tasks can be
mutually synchronized in such a way that their memory accesses don't
collide.  The memory/bus architecture of the Opterons is worth studying,
BTW -- HyperTransport changes the way things work enough that earlier
assumptions of how memory is bottlenecked (or not) may be incorrect.
The way memory tends to be associated with processors creates the
possibility of a "network-like" bottleneck in accessing memory
associated with the other processor in at least some dual CPU
architectures, made worse (as always) with multicores.  Memory bound
apps could actually run worse on multicore CPUs.

The one kind of application that unambiguously runs faster on multicores
is the kind that tends to run twice as fast on dual CPU systems -- CPU
bound tasks.  These are tasks that tend to do a lot of computation per
memory access per network access per disk access (where ideally they
access the network and/or disk only very rarely, at the beginning and
end of a computation, say).  In that case you can put tasks on all the
cores that run in parallel and do not contend for the same, task
limiting resource.

When studying your application, BTW, pay close attention to scale.  Many
parallel applications will scale poorly for "small" computations because
they do relatively little computation per IPC or memory hit, or because
memory hits are too narrow (working through a multidimensional matrix,
perhaps) to take advantage of the pipelines and prefetch capabilities of
the architecture.  Scale those same computations up to production, and
they are suddenly doing BIG vectors, and doing a lot of computation
between network transactions.  Task scale alone can sometimes move you
from a poor parallel scaling regime into one that scales roughly
linearly.  Some of the simple test programs I've played with will scale
NEGATIVELY for small runs (run slower on ten nodes than on one, for
example) but when you do a big run, will return nearly linear speedup.

> 2. Another question is that is dual-core technology brings any advantages
> for the efficient usage of high amount of memory that we will utilize? 3. 3.

More likely disadvantages.  Too many cars on a too narrow road.  But
this is SO dependent on your task that should very definitely continue
to study the possibility.  Until you not only know but thoroughly
understand the answers to the questions about task bottlenecks at
various tasks sizes raised above, it is hard indeed to say, and any
answer you might obtain could disappear and become a DIFFERENT answer if
you modified your application to take advantage of the architectures
good features and avoid its ungood features.


> 3. Finally there is something basic that I'm not sure: When we assign a job
> to dual-core CPU, can it divide it between the core-CPUs automatically, or
> should we think dual-core CPU the same as dual-node CPU? If the latter is
> the case, what is the advantage of this technology over dual-node?

If you assign "a job" to a dual core CPU, as far as I know you get no
benefit at all unless the task is multithreaded.  So it is more like a
dual CPU node.  A dual CPU, dual core node has basically 4 cpus, with
two cores sharing each CPU's data transport channels.  So bottlenecks on
those channels can become even worse.  I don't have enough experience
with them to know when and where that becomes an issue -- what has been
published on list suggests as always that it CAN be an issue but often
isn't.

   rgb

> 
> If anyone has info and/or experiences about these, I will be very glad to
> know. 
> 
> Thanks in advance,
> Tahir Malas
> Bilkent University 
> Electrical and Electronics Engineering Department
> Phone: +90 312 290 1385 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050922/264fcbd4/attachment.sig>


More information about the Beowulf mailing list