Beowulf & FFT

Mon Jul 24 14:05:54 PDT 2000

On Mon, 24 Jul 2000, Martin Siegert wrote:

> The problem is that you have to do a matrix transpose between the two
> loops. FFTW does this using MPI_Alltoall. As Tony pointed out this may not
> be the best choice. So I've been busy writing a matrix transpose routine
> myself - something that is trivial for serial programs is totally nontrivial 
> when the matrix is distributed over several processors. If somebody can 
> give me a hint or reference how this is done for best performance, I'll
> be thankful forever :-) [right now I'm trying MPI_Type_vector to define

I don't know if it will be of use to you, but on the off chance of
having somebody thankful forever (which is a very long time, after
all;-) check out:

http://www-unix.mcs.anl.gov/dbpp/text/node43.html#SECTION02540000000000000000

This is the "Convolution" section of "Designing and Building Parallel
Programs", by Ian Foster, an online book.  It covers 2d FFT's in fair
detail, including a comparison between transposition and pipelining
algorithms.  It suggests that pipelining might be better when N is
"small", as it appears to be for you.

http://www-unix.mcs.anl.gov/dbpp/text/node126.html#SECTION04330000000000000000

contains a discussion of tranposition algorithms appropriate for
hypercubical architectures, which also might be better for small N, if
you can rearrange your systems into a suitable hypercube.

You'll likely want the paper copy of the book if you find this useful.
I suspect that you already have found a lot of what it tells you though.
A very nice book and useful reference to bookmark if you are a
beowulfer, in any event.

> Actually, I can: I have to average over random initial conditions. At least
> in principle - I find that usually 4 samples are already enough, even a single
> sample already gives you the growth exponent quite accurately. That's why
> I hoped that I could run a single run in parallel - then I could go to
> larger system sizes and later times. If it can't be done, I'll do exactly
> as you said.

Good luck -- your problem sounds like you can probably find a region and
parallel algorithm where you get >>some<< parallel speedup for a
relatively small number of systems, although you may not get linear
scaling.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu