[Beowulf] dual-core benefits?
tmalas at ee.bilkent.edu.tr
Fri Sep 23 05:34:27 PDT 2005
> -----Original Message-----
> From: Robert G. Brown [mailto:rgb at phy.duke.edu]
> Sent: Thursday, September 22, 2005 8:53 PM
> To: Tahir Malas
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] dual-core benefits?
> The first thing you have to do is identify WHY the scaling of your code
> isn't so good -- 20 for 32 nodes.
Well, the answer is pretty simple; we have a highly sequential program.
Consider an tree structure in which the total message size is fixed but at
the leaves side all leaves communicate with each other, and as we go to
lower levels the number of messages decrease where as the message sizes
> It's good that you've run benchmarks,
> but you also have to do some deeper probing on the basis of those
> benchmarks to optimize your cluster engineering. You're also being wise
> doing first 8 nodes and then 16, independent of budget. Flaws in your
> design that reveal themselves can be fixed, maybe in the second round
> This is one of those questions that cannot be answered from the data
> given, and maybe not from data you have in hand, but it can be answered.
> First -- what is bottlenecking the parallel process? Memory access
> speed? Network IPCs? Local computation? A combination of the three?
> In particular you are interested in what is causing the fall-off from
> linear scaling -- as you run the job on more and more nodes, those nodes
> are spending more time communicating (for example) per unit of
> If you are bottlenecked at the network, adding more processing cores
> (and trying to use them) can actually SLOW DOWN your computation --
> effectively taking you to 64 nodes and doubling the burden on your
> already overloaded node network.
> If you are bottlenecked at the network, you should also look hard at
> your expenditure pattern. I'm assuming that you're using gigabit
> ethernet, as the cheapest mass-market network with decent bandwidth
> available for this range of nodes. However, there are much faster and
> more efficient networks available. Some of them are expensive enough
> that they will "cost you nodes" -- you'll have to get fewer nodes and a
> better network -- but they may restore your application scaling to close
> to linear. If you could equip your nodes with a faster network and keep
> scaling linearly across this regime, it would be worth it to spend up to
> six nodes to do it, for example, out of sixteen, as 10 (dual) nodes
> would still yield a performance of about 20, instead of requiring 16
> nodes to get the same 20. OTOH, managing only 10 nodes is cheaper and
> easier, providing power and AC to only 10 nodes is cheaper than it is to
> provide it to 16 (estimate -- six nodes at ~200W each per year is $1200
> US dollars a year for power and AC alone) -- with Amdahl's law, usually
> the fewer nodes you buy the better many things are.
Well, here comes the memory issue. We actually solve dense systems using
fast algorithms, and need a lot of memory. Being limited by the 16GB per mb,
we may need more nodes.
> If you are bottlenecked at the network, consider e.g. Myrinet, SCI,
> infiniband. Most of the high end networks will cost order of $1000-1500
> per node (IIRC) which works out just about right on the get 10 nodes
> with the high end networks instead of 16 without. Obviously, if you are
> using 100BT networks for any reason you should at LEAST use gigabit
> ethernet and on Opterons should probably use both gigabit interfaces per
> motherboard so each CPU has its own network before moving on.
Is this really the case? If I use both interfaces, can I safely assume that
each CPUs use different interfaces with no congestion in the mb? (for
> investigate the possibilities of RDMA ethernet adapters if your task
> involves moving data WHILE you are computing so that data movement over
> the network is blocking your application. One of the major benefits of
> the high end networks is that they all tend to enable DMA data transfer
> so that network transactions can complete while the CPU cranks on
> something else.
> If you are memory bound you are STILL likely not to see much benefit
> from multicore CPUs, although this depends somewhat on the actual
> pattern of memory utilization. With some effort, some tasks can be
> mutually synchronized in such a way that their memory accesses don't
> collide. The memory/bus architecture of the Opterons is worth studying,
> BTW -- HyperTransport changes the way things work enough that earlier
> assumptions of how memory is bottlenecked (or not) may be incorrect.
> The way memory tends to be associated with processors creates the
> possibility of a "network-like" bottleneck in accessing memory
> associated with the other processor in at least some dual CPU
> architectures, made worse (as always) with multicores. Memory bound
> apps could actually run worse on multicore CPUs.
> The one kind of application that unambiguously runs faster on multicores
> is the kind that tends to run twice as fast on dual CPU systems -- CPU
> bound tasks. These are tasks that tend to do a lot of computation per
> memory access per network access per disk access (where ideally they
> access the network and/or disk only very rarely, at the beginning and
> end of a computation, say). In that case you can put tasks on all the
> cores that run in parallel and do not contend for the same, task
> limiting resource.
> When studying your application, BTW, pay close attention to scale. Many
> parallel applications will scale poorly for "small" computations because
> they do relatively little computation per IPC or memory hit, or because
> memory hits are too narrow (working through a multidimensional matrix,
> perhaps) to take advantage of the pipelines and prefetch capabilities of
> the architecture. Scale those same computations up to production, and
> they are suddenly doing BIG vectors, and doing a lot of computation
> between network transactions. Task scale alone can sometimes move you
> from a poor parallel scaling regime into one that scales roughly
> linearly. Some of the simple test programs I've played with will scale
> NEGATIVELY for small runs (run slower on ten nodes than on one, for
> example) but when you do a big run, will return nearly linear speedup.
> > 2. Another question is that is dual-core technology brings any
> > for the efficient usage of high amount of memory that we will utilize?
> 3. 3.
> More likely disadvantages. Too many cars on a too narrow road. But
> this is SO dependent on your task that should very definitely continue
> to study the possibility. Until you not only know but thoroughly
> understand the answers to the questions about task bottlenecks at
> various tasks sizes raised above, it is hard indeed to say, and any
> answer you might obtain could disappear and become a DIFFERENT answer if
> you modified your application to take advantage of the architectures
> good features and avoid its ungood features.
> > 3. Finally there is something basic that I'm not sure: When we assign a
> > to dual-core CPU, can it divide it between the core-CPUs automatically,
> > should we think dual-core CPU the same as dual-node CPU? If the latter
> > the case, what is the advantage of this technology over dual-node?
> If you assign "a job" to a dual core CPU, as far as I know you get no
> benefit at all unless the task is multithreaded. So it is more like a
> dual CPU node. A dual CPU, dual core node has basically 4 cpus, with
> two cores sharing each CPU's data transport channels. So bottlenecks on
> those channels can become even worse. I don't have enough experience
> with them to know when and where that becomes an issue -- what has been
> published on list suggests as always that it CAN be an issue but often
> > If anyone has info and/or experiences about these, I will be very glad
> > know.
> > Thanks in advance,
> > Tahir Malas
> > Bilkent University
> > Electrical and Electronics Engineering Department
> > Phone: +90 312 290 1385
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf