Beowulf & FFT
tony at MPI-Softtech.Com
Tue Jul 18 17:02:14 PDT 2000
1) FFTW is fast, but not the fastest FFT for all platforms, it is
quite a nice piece of work however.
to a platform (such as CISC Intel architecture), by running a sequence
of different sizes, to build a strategy for the sequential FFTs
you use. This makes it competitive to be sure, but not optimal.
Like ATLAS, competitive is hard to beat unless you write assembler
and use the same tricks it uses plus more, or else specialize the
code to RISC or CISC, etc, the cache hierarchy, prefetching, and
permutation options, etc, etc.
2) The MPI_Alltoall call is far from optimal. For a 2D matrix transpose
it is for sure easy to write a call that is more optimal.
Here are some reasons why:
a) Each MPI alltoall has to work with derived datatypes to capture the
semantics of a matrix transpose; these are quite slow with some
implementations of MPI.
b) The usual implementation of all-to-all, based on the semantics of
the call, is N-1 sends and N-1 receives at each node. When you
know you are actually doing a 2D transpose, you can do better, such
as gather scatter algorithms, or sequences thereof. [Greg Lindahl
spoke of such tradeoffs at one meeting I was at.]
c) Persistent point-to-point communication (if implemented on a fast
implementation of MPI that makes them faster than regular point to
point), provide a good basis for writing a corner-turn (2D tranpose)
of your own, which you might vary in algorithm as a function of the
size of the processor set and matrix size.
Doing the corner turn yourself with your own calls will never be worse
than alltoall, given the state of the algorithms used (which in turn
derives from the difficulty of optimizing alltoall in general).
Surprisingly, people use alltoall as a way to measure the performance
of corner turns on some MPI implementations, even though you can write
a better, faster, still portable corner turn of your own, strictly
in terms of pt2pt calls.
Anthony Skjellum, PhD, President (tony at mpi-softtech.com)
MPI Software Technology, Inc., Ste. 33, 101 S. Lafayette, Starkville, MS 39759
+1-(662)320-4300 x15; FAX: +1-(662)320-4301; http://www.mpi-softtech.com
"Best-of-breed Software for Beowulf and Easy-to-Own Commercial Clusters."
On Tue, 18 Jul 2000, Martin Siegert wrote:
> On Tue, 18 Jul 2000, Walter B. Ligon III wrote:
> > > Since I'm in the process of expanding the beowulf, I'm wondering whether
> > > switching to 133 MHz would improve the results (given that I can find
> > > a motherboard that supports ECC - we had that discussion).
> > Generally, if the problem is that node-to-node communication is too slow,
> > upping your system bus speed isn't going to do much in the way of improving
> > your parallel system performance.
> True. However, I'm not completely convinced that node-to-node communication
> is the only limiting factor: The efficiency for np=2 is miserable as well.
> And that should be shared-memory communication (at least in mpich-1.2.0).
> > The first thing that strikes me is that a 400x400 operation is pretty small
> > for a Beowulf. Its going to be hard to get good efficiency on a small
> > problem like that.
> This is correct. The efficiency gets better, when I go to 1600x1600, but it's
> still misreable: speedup=1.25 for np=2. Furthermore, in my problem time scales
> with length to third or fourth power. Thus, I can't take advantage of the
> larger system sizes, if don't go to much, much later times.
> > I am not familliar with the FFT library you have, but it may be that it was
> > not designed for a machine like a Beowulf, but rather for a finer grain
> > machine like some of the MPPs. Thus the best approach may be a different
> > bit of software, or reworking that software. All-to-All is probably NOT
> > the best implementation on an MPI running over TCP/IP.
> The authors of FFTW claim that this is the fastest FFT you can get. I can only
> say that from all I know this statement is correct, i.e., all other FFTs
> that I tried were slower. Actually I wasted a month by writing my own FFT
> only to find out that is was slower as well :-(
> Furthermore, I can't really see how the necessary matrix transpose can be
> done more efficiently without the Alltoall.
> > Of course, you could always upgrade to a faster network if you have the $$$.
> Yup. That would be nice.
> Martin Siegert
> Academic Computing Services phone: (604) 291-4691
> Simon Fraser University fax: (604) 291-4242
> Burnaby, British Columbia email: siegert at sfu.ca
> Canada V5A 1S6
> Beowulf mailing list
> Beowulf at beowulf.org
More information about the Beowulf