Beowulf & FFT

Tue Jul 18 11:22:52 PDT 2000

Hi there,

this is probably related to the problems that have been discussed
recently under the topic "Beowulf & Fluid Mechanics". I've been
trying to optimize FFTs (fast Fourier transforms) on a cluster of
dual PII-400MHz PCs with switched fast ethernet. The results are not
promissing: The execution times for 100 forward and backward transforms
for a system size of 400x400 are 26.61s (using gettimeofday) for a
single processor and for np processors I get (in seconds using MPI_Wtime):

np    mpich-1.1.2      mpich-1.2.0     mpipro
2       28.54            24.18          27.28
4       33.00            29.29          24.50
8       16.80            16.69          16.41

mpich-1.1.2 is compiled for device=ch_p4, whereas mpich-1.2.0 is compiled
for ch_p4 with -comm=shared.
The FFT routines are from the FFTW library (www.fftw.org). The fftw_mpi_test
of the fftw distribution shows similar results. My understanding is that
most of the interprocess communication is due to a MPI_Alltoall in the
matrix transpose routine. None of the MPI distributions seem to handle
this particularly well :-(.

For mpich 4 processors are slower than 2 processors. It seems that
communication between processes on the same node is so much faster that
np=2 turns to be faster than np=4. However, the effect is the same in
mpich-1.1.2 and mpich-1.2.0; the latter does shared-memory communication.
mpipro doesn't show the same effect.

Since I'm in the process of expanding the beowulf, I'm wondering whether
switching to 133 MHz would improve the results (given that I can find
a motherboard that supports ECC - we had that discussion).

Any thoughts and comments on this are appreciated.

Somewhat off topic (those of you not interested in physics/stat.mech. may
stop reading here :-)

I need fast FFTs to integrate PDEs using spectral methods. Those PDEs describe
ordering dynamics of surfaces, pattern formation, etc. (similar to spinodal
decomposition). The usual way of doing these kind of things is using finite
differences. That's ok, if anisotropies aren't important. If they are, however,
finite differences are tricky, because they introduce artificial anisotropies.
Spectral methods avoid this. However, if spectral methods trun out to be
orders of magnitudes slower than finite differences, I'm not sure what to do.
For a single processor this isn't a problem, but when MPI comes into play
the efficiency of all these algorithms must be reevaluted. We probably all
learned something like "the Euler method is bad, implicit methods are superior".On a beowulf that isn't true anymore ...
If you have ideas how to get out of this dilemma, please let me know.

Cheers,
Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================