availability of Memory compression routine

Robert G. Brown rgb at phy.duke.edu
Wed Jul 17 22:49:35 PDT 2002

On Thu, 18 Jul 2002, Kwan Wing Keung wrote:

> Dear Colleagues,
> Recently I have been working in parallelization a user's program that
> involved repeated mpi_broadcast of a big 2D-array (around 1000*1000
> complex *16) from the master to each compute slaves.  The parallelization
> is now completed, but the speed efficiency is not very high.

Master-slave may be a poor communication paradigm, depending on what
mpi_broadcast actually does at the network level.  If it really uses a
network broadcast (so it puts the data out just one time, at which point
it is picked up by all the slaves at the same time with no signficant
bottlenecks at the switching level) then the best you can do is make
your network faster or arrange to do more computation per communication
block to improve your scaling.  However, "broadcasts" in the context of
parallel communications packages may well be hiding serial transmission
to all the hosts one at a time.

If this is so and the word "repeated" above refers to the master sending
the whole block to slave 1, then the whole block again to slave 2, etc.
there are a number of ways to speed things up.  One is to use a true
broadcast to get the data to all slaves at once.  Another is to pass the
data blocks down a tree, so the master sends to 1, then 1 sends to 3
while the master sends to 2, then the master, 1,2,3 send to 4,5,6,7, and
so forth (which takes order of log_2(N) steps to send to N hosts, not N
steps).  A third is to use a faster network -- if using 100 BT, upgrade
to 1000 BT, for example.

I doubt compression will help you much unless your physical network is
horribly slow (e.g. 10 BT, T1, sneakernet).  Data compression is costly
in compute (and real) time.

Your basic problem is, of course, that sending 16 MB just one time from
the master to a slave likely takes close to 2 seconds on a 100BT
network.  If the N slave nodes you send to then work for five minutes
before they need another block, your parallel efficiency is quite good.
You do 300*N seconds of work per 2*N seconds of communication every
302*N seconds.  You could get parallel speedup then to the point where
your master was constantly working delivering the next work slice and
just exactly had the next slice ready to send when a node finished the
previous slice.  As soon as a node has to wait idle while the master
finishes sending to another node, though, your speedup saturates if it
hasn't already for other reasons.

OTOH, if your slave nodes work for only (say) 8 seconds per slice, then
when your master is basically sending all the time and is sending the
next slice of work to a node as soon as it finishes the previous slice,
it is supporting only about four slaves.  Those four slaves do around
4x8 seconds of work (32 work-seconds) in about 16 seconds (8 seconds to
send the data to each node, plus 8 seconds for the slowest node to
finish for a speedup of about 50%, as the master could do 16
work-seconds all by itself in the same time.  This sounds so very much
like your situation that I'd guess that it agrees quantitatively.

To improve node scaling, you have to increase the amount of computation
done per unit of communication.  You can do this by doing more work per
transmission of a matrix to a node, or by decreasing the time required
for the transmission as described above.  Sometimes just making the
problem (paradoxically) BIGGER improves this ratio -- if your
communication time scales like N^2 (for N = 1000 above) but the
computation time scales like N^3 or more, than going to N=10000 would
probably let you see a decent speedup for as many as 30-40 nodes.

To understand all of this better and see how to compute it (and possibly
repair it) you should look at books on parallel computation.  There are
some curves and algebra that might help in the online book on


(and other references you might find on the brahma site).  You'll
probably have to look at the figures in the postscript or pdf image -- I
still need to fix the online sources so that they show the figures
(latex2html won't inline figures unless they are in a figure
environment, grrr).



> Basically we found that upon using 4-5 processors, the program can speed
> up to around 60% (i.e. 40% of the original serial execution time).
> Further increase in no. of processor will not help.  Though the clocked
> CPU time for each slave goes down, the wallclock duration is nearly flat.
> Likely the network is already "saturated".
> By using the Hermitian property, I can now reduce the communication
> size by half (the upper triangle can be locally generated from the
> elements in lower triangle in each slave after communication).
> The "saturated" time is now reduced for another 45%.
> My question is now whether we have a generic memory compression routine
> that allow the compression of a big memory chunk to a much smaller one
> like that used in "zip" or "compress".  Of course we are talking about
> compression for memory variable inside a standard Fortran program BUT
> NOT the compression in a disk file.
> In this case we can first compress the huge array and then use
> mpi_broadcast to send the compressed data.  Upon receiving the compressed
> data, each slave can decompress it to retrieve the original data.
> In simple word, we are sacrifying local computation vs communication.
> Any suggestion is whole heartedly welcome.
> W.K. Kwan
> Computer Centre
> p.s. I prefer the compression/decompression routines in pure F77 coding,
> i.e. with no recursion.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list