[Beowulf] dual-core benefits?

Fri Sep 23 07:53:01 PDT 2005

Tahir Malas writes:

>> -----Original Message-----
>> From: Robert G. Brown [mailto:rgb at phy.duke.edu]
>> Sent: Thursday, September 22, 2005 8:53 PM
>> To: Tahir Malas
>> Cc: beowulf at beowulf.org
>> Subject: Re: [Beowulf] dual-core benefits?
>> 
>> 
>> The first thing you have to do is identify WHY the scaling of your code
>> isn't so good -- 20 for 32 nodes.  
> 
> Well, the answer is pretty simple; we have a highly sequential program.
> Consider an tree structure in which the total message size is fixed but at
> the leaves side all leaves communicate with each other, and as we go to
> lower levels the number of messages decrease where as the message sizes
> increase.

I'm TRYING to consider this, but you say: 

  "highly sequential program" vs "all leaves communicate with each
other"

  "total message size is fixed" vs "number of messages decrease as
message sizes increase"

and on top of that I usually think of trees as being LR or top-down
structured:

       <
      /
     <
    / \
 --<   <... etc
    \
     <

but your description (all leaves communicate and number of messages
decreases with depth) suggest that your tree runs from the leaves down
to the trunk.

So you might have to be a bit more specific:-)

>From your description, though, it sounds like you have a highly PARALLEL
program with lots of interprocessor communications of the worst possible
pattern for good scaling -- all nodes talk to all nodes at many stages
of the processing pipeline.

> Well, here comes the memory issue. We actually solve dense systems using
> fast algorithms, and need a lot of memory. Being limited by the 16GB per mb,
> we may need more nodes. 

>> If you are bottlenecked at the network, consider e.g. Myrinet, SCI,
>> infiniband.  Most of the high end networks will cost order of $1000-1500
>> per node (IIRC) which works out just about right on the get 10 nodes
>> with the high end networks instead of 16 without.  Obviously, if you are
>> using 100BT networks for any reason you should at LEAST use gigabit
>> ethernet and on Opterons should probably use both gigabit interfaces per
>> motherboard so each CPU has its own network before moving on.  
> Is this really the case? If I use both interfaces, can I safely assume that
> each CPUs use different interfaces with no congestion in the mb? (for
> dual-node single-core)

I don't believe you can assume it, no, but you can probably program it
that way with some attention paid to routing and communication patterns.
Or you can perhaps channel bond the interfaces and increase the width of
the pipe to accommodate the IPCs of the two onboard processors.
However, there are still details of the communications pattern that are
important.  For example your description above of all-to-all
communications still doesn't make it clear whether you are latency or
bandwidth dominated or -- gak -- both.  If there is a stage where you
are sending lots of little fixed size messages between all the nodes you
might well be latency dominated, and then transition to bandwidth
dominated as the message sizes increase and the number of messages
decrease (and the number of nodes participating decreases?).

I'm also curious as to why and how you parallelize.  You have a single
node rate that you're using as your task baseline -- that's how you get
your 20x single node speed for 32 nodes.  Are you parallelizing to be
able to do larger problems than will fit on a single node?  Or to reduce
the time it takes to complete a single problem?  If the latter, is your
work pattern such that you can get the same net benefit by running 32
instances of the single node task on the 32 nodes, let them each take
the single node completion time, but get 32 work units done in that
time?

I mean, people on this list might make fun of embarrassingly parallel
programs, but One True Goal of the parallel programmer is to find work
patterns that are as EP as possible, as EP programs get nearly linearly
scaling across the broadest possible range of hardware configurations.
So if you have an endless supply of work units you need to do and the
per-task latency (how fast you complete any one of them) is less
important to you than the total-task bandwidth (how much work you do per
unit time) consider reorganizing your task as N single node tasks.

The payoff for doing so is that suddenly dual core cpus may make sense,
you (probably) don't have to mess with both ethernet controllers or
consider an expensive network (unless there are e.g. data storage issues
that still make the network or a server a bottleneck), and you can scale
your cluster to VERY large numbers of nodes and still get essentially
linear speedup (limited typically by task setup and takedown time -- how
long it takes to start the task and retrieve the results compared to its
runtime).

If you cannot make your task EP -- maybe you're building something like
a lookup engine or search engine (what needs a tree and this sort of
communications pattern, hmmm:-) or graphical processing engine that
takes tasks one at a time and has to process them fast enough that a GUI
user doesn't get bored and go home for the day or maybe you are doing
work on some sort of spatial lattice and are using the cluster to
partition the lattice so that you can manage larger lattices than a
single node will hold -- then it sounds like you need to perhaps use Ian
Foster's online book on parallel programming or similar resources to
work out optimal ways of managing e.g. the task partitioning and
communications patterns.  You should probably also look at e.g. MPICH2,
which has functions that already use pretty highly optimized
communications patterns for the kind of worst-case all-to-all message
passing.

This is important because there are significant scaling issues
associated with communications.  In the worst of worst cases, each node
has to send all other nodes a message for that node only (and of course
receive a message from that node in turn).  The amount of time this
takes can differ by factors of order N*T (T the fixed message time)
depending on the communications pattern used, where naive patterns are
likely to significantly and negatively impact the parallel scaling of an
IPC bound computation.  

Ditto the communications patterns associated with e.g. task
partitioning.  In many lattice partitionings, computation time (per
node) scales like volume where communication time scales like the area
(of the lattice partition per node).  Parallel efficiency often depends
strongly on the ratio of computation time to communication time (using,
one hopes, an optimal communication pattern as mentioned before).  I
bring this up because even if you have long-range communications
(implied somewhat by all-to-all or each-to-all) there are usually
similar scaling considerations in task partitioning.

THIS in turn affects your benchmarks.  If you benchmarked by taking a
task at some fixed "size", call it L, and partitioning it into L/N-sized
pieces, expecting them to complete in T_0/N time where T_0(L) is the single
node time for the task at size L, and each node has to communicate with
all other nodes for a total communications time (optimized) of (say)
(N-1)*T_m(L) where T_m(L) is the time required to send all requisite messages
in a pattern that causes minimal task blocking, then:

  T_tot(L,N) = T_0(L)  +  (N-1)*T_m(L) = T_0(L) ( 1  +  (N-1)*T_m(L))
                ---                               -           ------
                 N                                N           T_0(L)

is the estimated time to completion and 

R(L,N) = 1/(1/N + (N-1)T_m/T_0)

is the RELATIVE speedup of the computation (compared to 1 for 1 node).
(There are usually other serialized contributions for startup and
shutdown, for example, but I'm assuming that these are negligible
compared to T_0/N at all scales N you are likely to reach, which may not
be true if the task only takes order of seconds on a single node).

Linear speedup occurs where T_m/T_0 is "small", specifically, where 1 >>
N^2(T_m/T_0).

Usually BOTH T_0 AND T_m are functions of L.  The trick is that they can
be DIFFERENT functions of L.  In the nearest-neighbor lattice
partitioning case:

   T_0 \propto L^d (volume, where d is the lattice dimension)

and

   T_m \propto L^(d-1)  (surface).

Increasing L then actually shifts the ratio in a beneficial way: T_m/T_0
\propto 1/L.  So does decreasing the BASIS of T_m (e.g. using a faster
network!).  So does using DMA, in the event that your task organization
is such that it doesn't have "barriers" but can continue computing even
while communicating (in parallel) as it is only the serial blocking
time required for a network communication that contributes to T_m.

Anyway, this is the reason I was asking for details of your
communications pattern, because trying to figure out what CPU you should
use and what network (and what your relative expenditure should be on
nodes vs network vs memory vs degree of multiprocessing) you should use
in a QUANTITATIVE way involves analysis LIKE that which I outline here,
but of course modified for the specific characteristics of your
application.  

Very similar analysis contributes to what happens to T_0 (and T_m!) as
you go from UP systems with a single NIC per processor and dedicated
memory (where now you have to think in terms of MEMORY bottlenecks and
how long the CPU effectively sits idle waiting for data from memory in
different computational patterns) to dual SMP systems (where now there
can be a "network like" hit for simultaneous access to memory where one
task is blocked by the other and has to wait idle, increasing average
T_0 relative to a UP system and reducing efficiency) to dual dual core
SMP systems where the problem may be further exacerbated -- or not,
depending on memory access pattern and the ratio T_C/T_M (computation to
memory access times).

Here you also have to consider the effect of increasing degrees of MP on
T_m.  Basically, T_m algebraically splits.  T_ml (local) is the time
required to send your message(s) to the other CPUs on the same system,
and is typically T_ml << T_mn (network), the time required to send the
SAME message on the network.  However you've also REDUCED the network
parallelism available to the SMP cpus.  They must share a fixed
bandwidth, fixed latency channel and so for P processors on one network
channel it is LIKELY that T_mn \approx T_m*P for a single CPU on a
single network channel.  So your network communications time goes UP,
while your communications time between processes on the single node go
DOWN (often so far down you can safely consider them to take "zero time"
compared to the network as far as scaling estimates are concerned).

This is what I was referring to in my previous communication to Michael
Will when he was proposing highly MP systems as maybe being good for
your task.  The problem is that it is maybe.  Suppose P is the number of
processors per box in an N processor cluster.  Lets re-estimate the time
of completion assuming T_0 doesn't significantly vary (ignore memory
contention etc) and that the P processors must share a single
communications channel.

If P = 1, we have the original result.  If P != 1, we have
half the number of boxes but each box has two processors (so N remains
the same).  This is a null result as far as comparative processing time
is concerned -- we still scale like 1/N here.  To be concrete, for your
dual processor, N = 32 cluster, you get a 1/32 reduction in time from
splitting up the task.

Each PROCESSOR has to send N - 1 messages total in a non-colliding
pattern (that can complete in parallel).  P-1 of them are on the same
box and take "zero" time.  The remaining N - (P - 1) - 1 = N - P of them
take P*T_m time.  Again to be concrete, each processor has to talk to
one processor on the same system very very quickly and 30 processors
over the network, where the network is twice as slow.  In general, our
time of completion scales something like:

  T_tot(L,N,P) = T_0(L) ( 1  +  (N - P)*P*T_m(L))
                          -               ------
                          N               T_0(L)

This is a pretty messy function in three dimensions.  N = P is obviously
one desireable limit -- it REALLY shifts you back to the original form,
but with an internal T_m/T_0 that is typically orders of magnitude
smaller than ethernet in an e.g. shared memory architecture -- for small
N it pretty much restores linear scaling without question if you get
POSITIVE speedup over ethernet.

However, it has potentially BAD properties for small P.  For small P
(say P = 2 and N = 32) your communications time is 30*2*(T_m/T_0) which
is 15/8 times LONGER than it was for 32 processors each with its own
independent communications channel.  As long as N >> P this factor will
be roughly P times larger than it would be in a UP cluster.  Eventually
there is a break-even point, of course, and you can even compute it (but
I'm too lazy to).  Given the COST scaling of highly SMP systems, though,
it is relatively unlikely that you'll even break even with N UP systems.

This is (very precisely) why I suggested one way or another using both
ethernet channels on your motherboards if you use a dual processor
system on a network bound task.  The important thing is the bandwidth
PER PROCESSOR in the analysis above -- the factor of two goes away if
you use two ethernet channels on a dual processor motherboard.  A factor
of two is significant here -- probably order of half the distance to the
goal at these scales even if you don't can't make ANY other adjustments
in your task organization or basic cluster design, so your 32 nodes
might speed up the computation by 26 instead of 20.  26/32 is starting
to be pretty respectable -- over 80% efficiency.  Ensuring a
non-colliding each-to-all communication pattern might well give you
another factor of two and move you to maybe a speedup of 28.  To do
better, look into the OTHER things above, or as I suggested, change
networks.  A high end network and suitable communication pattern might
push your computation up into the high .9's -- essentially linear
scaling across this range of N and P.

    rgb
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050923/279bbe89/attachment.sig>