Beowulf & FFT

Sun Jul 23 21:17:36 PDT 2000

On Tue, 18 Jul 2000, Martin Siegert wrote:

> > I am not familliar with the FFT library you have, but it may be that it was
> > not designed for a machine like a Beowulf, but rather for a finer grain
> > machine like some of the MPPs.  Thus the best approach may be a different
> > bit of software, or reworking that software.  All-to-All is probably NOT
> > the best implementation on an MPI running over TCP/IP.
> 
> The authors of FFTW claim that this is the fastest FFT you can get. I can only
> say that from all I know this statement is correct, i.e., all other FFTs 
> that I tried were slower. Actually I wasted a month by writing my own FFT
> only to find out that is was slower as well :-(
> Furthermore, I can't really see how the necessary matrix transpose can be
> done more efficiently without the Alltoall.

Sorry about the very late response, but I've been out of town.  

  a) Have you looked into the various books on parallel algorithms that
are on the market (and in a few cases on the web)?  I'd wager that this
problem has been studied as FFT's seem like they'd be fairly important.
These books also go over things like parallel matrix operations where
the "best" algorithm can suddenly change to something quite nonintuitive
depending on matrix scale and the various speeds.  Even optimizing only
for the relative speed differential of L1/L2/Main memory, ATLAS has to
work quite hard and changes algorithms and blocksizes altogether several
times.

  b) Is there a way of reformulating the problem so that it is
embarrassingly parallel?  That is, since you are scale limited anyway to
smallish "runs" that might well fit on a single CPU UP system, is there
any benefit to just running a bunch of these calculations independently
in parallel?  I have similar tradeoffs in my Monte Carlo code -- I could
probably do really large lattices by splitting up the lattices between
systems and paying a surface-to-volume network IPC penalty, but critical
slowing down and the brutal quadratic scaling of statistics (need four
times as many samples to halve the error) kills me long before that
becomes a viable alternative.  Instead I run lattices that fit easily
onto single CPU's (which are also small enough that I have a chance of
accumulating enough independent samples to get high precision results in
my lifetime) and concentrate on accumulating decent statistics at these
scales with independent parallel runs.