availability of Memory compression routine

Michael Prinkey mprinkey at aeolusresearch.com
Fri Jul 19 07:47:29 PDT 2002

While I wholeheartedly concur with Dr. Brown about the need for good 
fundamental parallel algorithm design, the question does raise an 
interesting point.  For some problems (CFD) which are not fundamentally 
reducable to coarser levels of parallelism, interface conditions 
must often be passed among PEs.  Usually, these interface conditions 
represent a physical field of some type which is continuous or piece-
wise continuous, or at least mostly of the same order of magnitude.
It seems likely that some form of differential encoding or other 
lossless compression scheme should be applicable.

Some time ago, a similar issue arose with data file storage for an 
application.  We had been working on binary data storage.  Gzipping 
the binary data resulted in only slight compression.  Gzipping an 
ascii form of the same data reduced it quiet a bit, in some cases 
smaller than the gzipped binary.  So, the point here is that gzip 
is probably not appropriate.

Is there any more appropriate "profile" or "field" compression schemes 
available that leverage this type of differential encoding?  If not,
it might make any interesting research project for parallel processing 
as well as data storage, and visualization applications.

Mike Prinkey
Aeolus Research, Inc.

At Friday, 19 July 2002, "Robert G. Brown" <rgb at phy.duke.edu> wrote:

>Another suggestion:  From the sound of things, the master is doing very
>little "work" except for farming out the base matrix for each node's
>next unit of work.  You also don't indicate whether the work sequence 
>independent or dependent (does the second matrix sent out depend on the
>results sent back by all the nodes) or synchronous (so that all the
>nodes have to finish at a barrier before the next work slice can be
>begun).  It could be that the base matrix isn't horribly difficult to
>compute and is independent for each slice OR that you have a big 
file on
>disk of base matrices and are reading them in and sending them out one
>at a time.  In either case you should likely reconsider your code
>One secret to happy parallelization is to try to push your code
>structure towards the "ideal" of embarrassingly parallel.  You should
>therefore sit down and think carefully about the architecture of the
>problem and whether it is possible to reformulate it (possibly in an
>"odd" way) so that it is more nearly EP.
>You haven't indicated what the actual physics is of the problem you're
>working on, but if the master is computing the matrices it is sending
>out (perhaps generating them from a distribution and a random number
>seed) and if it takes signficantly less time to compute a new matrix
>than to send it via the net, you can very likely restructure your
>problem so that it is EP, or nearly so.
>For example, if your slaves had onboard the RNG and initial matrix
>computation routines, all the master would need to do is send all the
>slaves a starting RNG seed (once, at the very beginning) and an index
>indicating where a given slave should start generating its next matrix.
>The master would basically do trivial bookkeeping -- send out just
>enough information for the slaves to do ALL the work and then kick back
>and await results.
>A second, even more effective way to make the problem EP exists if you
>have large, independent blocks of parameter space that you are
>interested in exploring.  Simply divvy it up among the nodes and let
>each node run the job as a SERIAL task.  Node 1 runs the job(s) 
you were
>going to do on Monday.  Node 2 does the ones you were going to do on
>Tuesday.  Etc.  This gives nearly perfect parallel speedup unless
>(again) the data required to initialize a day's work is really, really
>big.  This will also work even if a given work sequence for a given day
>has dependencies, so that the second matrix sent out depends on the
>results sent back from all the nodes from the first one. 
>Remember, although PVM and MPI are indeed very cool, if you write code
>that REQUIRES their use you have in some sense already lost the 
>wars, moving from EP or extremely coarse grained parallelism (which
>scales well on nearly any network architecture) into medium to fine
>grain parallel, which requires fairly careful thought and co-engineering
>of task and beowulf to achieve the best possible scaling.  Master/slave
>computations can often be engineered to be EP, and a proper account of
>their parallel speedup must include the serial time required to 
>or read in the matrices you are working on.  
>Just imagine that you have just one computer to do the computations on.
>Now somebody gives you a second, but it isn't on the network at all --
>it has maybe a cd drive you can use to move code onto it and perhaps
>read in a big block of initialization data.  Can you use it
>constructively?  Can you crank up a SERIAL computation on it that does
>computations that are truly independent of those on the first, but 
>mean that you finish all the work required for your publication in half
>the time?  If so, you win -- architect your parallelized task the same
>way, with the convenience of a network instead of a CD drive and you
>should be able to achieve near-linear speedup for LOTS of nodes.
>Even if the matrices are being read from disk, they had to get to disk
>somehow.  If they were sent there from, e.g. an experiment with some
>sort of network copy, it would be a simple matter to arrange for 
them to
>get distributed directly to a scratch space on the nodes (disk is so
>cheap nowadays it is very likely that each node has over 30 GB of
>scratch disk anyway).  This would significantly improve execution time
>-- obviously it is "bad" to send the matrices first to the master,
>then sends them to the slaves (two network copies), instead of sending
>them directly to the slaves (one network copy).  Again, the master's
>role is reduced to bookkeeping and collecting (hopefully much smaller)
>The only way you're likely to get stuck is if you're doing a single,
>monolithic computation with complete dependencies and synchronicity (so
>there is just a single thread of matrices you wish to explore and every
>new matrix is generated from the results computed by all the nodes in
>the previous step,) or if the matrices are e.g. on tapes being sent to
>you from Fermilab or the like, in which case you've got a nasty double
>bottleneck to consider -- the tape interface (slow) and the network
>(also slow).  In the first the best you can do is speed up the network.
>In the second you're up against Amdahl's Law here, generalized to
>account for serial and parallel fractions of work.  Reading from 
tape is
>a serial bottleneck that won't go away.  A high end tape has a data
>transfer rate on the order of 6 MB/sec sustained, plus seek times which
>can easily be seconds or longer.  Clearly the network (at 9-10 MB/sec)
>is no longer a bottleneck, so going to gigabit ethernet would not help
>In any event, you definitely need to study your problem carefully,
>especially the bottlenecks, to see how to redesign it to run efficiently
>on large numbers of nodes.
>   rgb
>Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>Duke University Dept. of Physics, Box 90305
>Durham, N.C. 27708-0305
>Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http:

More information about the Beowulf mailing list