availability of Memory compression routine
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Jul 17 22:49:35 PDT 2002
- Previous message: availability of Memory compression routine
- Next message: E7500 Chipset
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, 18 Jul 2002, Kwan Wing Keung wrote: > Dear Colleagues, > > Recently I have been working in parallelization a user's program that > involved repeated mpi_broadcast of a big 2D-array (around 1000*1000 > complex *16) from the master to each compute slaves. The parallelization > is now completed, but the speed efficiency is not very high. Master-slave may be a poor communication paradigm, depending on what mpi_broadcast actually does at the network level. If it really uses a network broadcast (so it puts the data out just one time, at which point it is picked up by all the slaves at the same time with no signficant bottlenecks at the switching level) then the best you can do is make your network faster or arrange to do more computation per communication block to improve your scaling. However, "broadcasts" in the context of parallel communications packages may well be hiding serial transmission to all the hosts one at a time. If this is so and the word "repeated" above refers to the master sending the whole block to slave 1, then the whole block again to slave 2, etc. there are a number of ways to speed things up. One is to use a true broadcast to get the data to all slaves at once. Another is to pass the data blocks down a tree, so the master sends to 1, then 1 sends to 3 while the master sends to 2, then the master, 1,2,3 send to 4,5,6,7, and so forth (which takes order of log_2(N) steps to send to N hosts, not N steps). A third is to use a faster network -- if using 100 BT, upgrade to 1000 BT, for example. I doubt compression will help you much unless your physical network is horribly slow (e.g. 10 BT, T1, sneakernet). Data compression is costly in compute (and real) time. Your basic problem is, of course, that sending 16 MB just one time from the master to a slave likely takes close to 2 seconds on a 100BT network. If the N slave nodes you send to then work for five minutes before they need another block, your parallel efficiency is quite good. You do 300*N seconds of work per 2*N seconds of communication every 302*N seconds. You could get parallel speedup then to the point where your master was constantly working delivering the next work slice and just exactly had the next slice ready to send when a node finished the previous slice. As soon as a node has to wait idle while the master finishes sending to another node, though, your speedup saturates if it hasn't already for other reasons. OTOH, if your slave nodes work for only (say) 8 seconds per slice, then when your master is basically sending all the time and is sending the next slice of work to a node as soon as it finishes the previous slice, it is supporting only about four slaves. Those four slaves do around 4x8 seconds of work (32 work-seconds) in about 16 seconds (8 seconds to send the data to each node, plus 8 seconds for the slowest node to finish for a speedup of about 50%, as the master could do 16 work-seconds all by itself in the same time. This sounds so very much like your situation that I'd guess that it agrees quantitatively. To improve node scaling, you have to increase the amount of computation done per unit of communication. You can do this by doing more work per transmission of a matrix to a node, or by decreasing the time required for the transmission as described above. Sometimes just making the problem (paradoxically) BIGGER improves this ratio -- if your communication time scales like N^2 (for N = 1000 above) but the computation time scales like N^3 or more, than going to N=10000 would probably let you see a decent speedup for as many as 30-40 nodes. To understand all of this better and see how to compute it (and possibly repair it) you should look at books on parallel computation. There are some curves and algebra that might help in the online book on http://www.phy.duke.edu/brahma/ (and other references you might find on the brahma site). You'll probably have to look at the figures in the postscript or pdf image -- I still need to fix the online sources so that they show the figures (latex2html won't inline figures unless they are in a figure environment, grrr). HTH rgb > > Basically we found that upon using 4-5 processors, the program can speed > up to around 60% (i.e. 40% of the original serial execution time). > Further increase in no. of processor will not help. Though the clocked > CPU time for each slave goes down, the wallclock duration is nearly flat. > Likely the network is already "saturated". > > By using the Hermitian property, I can now reduce the communication > size by half (the upper triangle can be locally generated from the > elements in lower triangle in each slave after communication). > The "saturated" time is now reduced for another 45%. > > My question is now whether we have a generic memory compression routine > that allow the compression of a big memory chunk to a much smaller one > like that used in "zip" or "compress". Of course we are talking about > compression for memory variable inside a standard Fortran program BUT > NOT the compression in a disk file. > > In this case we can first compress the huge array and then use > mpi_broadcast to send the compressed data. Upon receiving the compressed > data, each slave can decompress it to retrieve the original data. > In simple word, we are sacrifying local computation vs communication. > > Any suggestion is whole heartedly welcome. > > W.K. Kwan > Computer Centre > HKU > > p.s. I prefer the compression/decompression routines in pure F77 coding, > i.e. with no recursion. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: availability of Memory compression routine
- Next message: E7500 Chipset
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
