[Beowulf] dual-core benefits?

Vincent Diepeveen diep at xs4all.nl
Mon Sep 26 10:22:30 PDT 2005

Hello Tahir,

>From your algorithmic description i understand you're doing something
similar like i'm doing in computerchess. That's also searching trees where
latency is important.

First question: did you already run over a network with it?

In my tree searches i measure the time it takes to get 8 bytes randomly
from a random node for transposition reasons. 

The latency to get a single message for an innernode over a network around
6-10 us (us=microseconds)whereas getting such a message from memory is
around 111 ns (ns=nanoseconds) at a single cpu opteron. It's about 150 ns
when you run quad (single core) opteron. 

However dual opteron dual core, which is the most price efficient node
possible currently, it has a 200ns latency (for 1GB).

Itanium2 has a latency of 280ns for 8 bytes (when using a small buffer of
just 400MB).

Quad opteron dual core 1.8Ghz with 2GB buffer needs a latency of roughly
234 ns in my tests.

Please realize that when the number of nodes increases, that % chance you
need to get your information from a remote node INCREASES too.

The latency from infiniband is of course worse than from Myri, Quadrics or

Note that there is alternative networks for you other than Myri. Dolphin
might be faster for latency reasons, Quadrics has a shmem library which
might do exactly what you want.

Of course you need to do EFFORT then to get it to work.

You should not save on the price of the network.

Please note that at networks there is many ways, which require some hard
programming, to get to work many nodes. 

It took 1.5 years of hard programming, but then i managed to rewrite a very
good SMP solution to a good working 'cluster' solution.

Usually it is possible to rewrite some latency dependant issues to
bandwidth dependencies.

You really want to look well which mainboard you take as i bet you want
onboard at least pci-x 133Mhz at a mainboard, in order to get a better
latency for the network.

Please also consider the option of using quad dual cores as your nodes with
in each quad 2 network cards.

For example what you can consider is just getting 2 network cards and no
switch or whatever.

With 2 quad dual cores you have together then 16 cores, just the cost is 2
network cards and a cable, or 4 network cards and 4 cables if you put in 4
network cards.

That avoids an expensive switch.

Most cost effective is of course:

1 switch (8 node switch) connected to 8 dual opteron dual cores

I'd pick quadrics myself for the shmem library they provide, which will
speedup your latency bigtime, as you can avoid the slow MPI overhead,
assuming your packets are fixed size.

No need to check, you can do everything lockless and just keep streaming
packets to nodes, even when they are perhaps not needed in the required
node, as long as they get in the right memory using DMA transfers.

A possible form is to use a token ring idea with respect to the tree search.

This is hard work to rewrite such algorithms to, but it is possible.

That should work till a node or 8. 16 i doubt.

More switches and routers mean a slower latency , don't forget that.

Using 1 switch is the maximum, if you ask me!

For token ring transfers in a small network, you can consider putting in 2
network cards from a manufacturer and directly connecting a cable between

So node A connects to node H and to node B. Node B connects to node A and
node C, node C connects to node B and node D and so on.

So you have 1 network card that is just streaming in 1 direction and you
can use the full bandwidth of it.

The price of this is 2 network cards a node and 1 cable a node. Note not all
networks can work without switch. But i'm sure the persons in question can
answer you this. 

With the token ring principle you can cheaply build nodes, just requiring 2
network cards. Probably requires a complete rewrite of your software. The
huge advantage of such token ring principles is that more network cards you
might be able to use. Infiniband i'm not familiar with, but perhaps they
can post here the bandwidth you can push through it at a dual opteron node.

Such a rewrite to a token ring algorithm is worth it.

It is NOT easy to let tree searches requiring latency for the hashtables to
run them on bad latency networks (bad in comparision to local memory latency).

Consider buying a 8 processor mainboard with quad cores. That's not
requiring a rewrite. In UAE you can cheaply buy that stuff. Practically
untaxed there.


p.s. i would take cheap dual core nodes, as in januari 2006 the quad cores
are there already. knowing many turkish myself i know usually they plan
long ahead and if that's the case here, consider getting quad cores.

At 10:33 AM 9/22/2005 +0300, Tahir Malas wrote:
>Hi everyone,
>I would like to take advice for the processor selection for the cluster that
>we will configure soon. Comparing the sequential performance of our programs
>on an Opteron 246 and a much more expensive machine with Itanium Processor,
>we have decided to use opteron processors with Tyan mbs. However, we are in
>a confusion to decide on the processor selection. Before posing my
>questions, I'd better give some info about our application requirements:
>1. The scalability of our program is not so good, less then 20 for 32 nodes
>(measured on a single node system). So we don't plan to go beyond 16 nodes.
>(which makes 32 processors due to dual-node usage)
>2. Memory requirement is huge; we will use 4GB memory per node for the time
>being and increase this to 16 GB later. So wee need fast CPUs and efficient
>usage of memory.
>3. Due to budget limitations we will first configure 8-node system with 4GB
>RAM per node and extend this to a 16-node system with 16-GB of RAM in 6
>We were thinking of AMD 250 processors, but now the benchmarks of dual-core
>CPUs (on the web site of AMD) seems encouraging, and the cost of dual-core
>AMD 275 seems to be less then twice of AMD 250. Since the memory cost of our
>system will dominate other costs, we can afford to pass to dual-core
>technology. However, the questions that arise are follows.
>1. Will it worth? And can we gain any advantages over single-core with the
>not-so-good scalability of our parallel programs? 
>2. Another question is that is dual-core technology brings any advantages
>for the efficient usage of high amount of memory that we will utilize? 3. 3.
>3. Finally there is something basic that I'm not sure: When we assign a job
>to dual-core CPU, can it divide it between the core-CPUs automatically, or
>should we think dual-core CPU the same as dual-node CPU? If the latter is
>the case, what is the advantage of this technology over dual-node?
>If anyone has info and/or experiences about these, I will be very glad to
>Thanks in advance,
>Tahir Malas
>Bilkent University 
>Electrical and Electronics Engineering Department
>Phone: +90 312 290 1385 
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list