Beowulf & FFT
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Martin Siegert siegert at sfu.caMon Jul 24 13:01:24 PDT 2000
- Previous message: Beowulf & FFT
- Next message: Beowulf & FFT
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Mon, 24 Jul 2000, Robert G. Brown wrote: > Sorry about the very late response, but I've been out of town. > > a) Have you looked into the various books on parallel algorithms that > are on the market (and in a few cases on the web)? I'd wager that this > problem has been studied as FFT's seem like they'd be fairly important. > These books also go over things like parallel matrix operations where > the "best" algorithm can suddenly change to something quite nonintuitive > depending on matrix scale and the various speeds. Even optimizing only > for the relative speed differential of L1/L2/Main memory, ATLAS has to > work quite hard and changes algorithms and blocksizes altogether several > times. I've looked into everything I could get my hands on (which may not be enough). Anyway, most of the literature on parallel FFTs (yes, at least in the past FFTs were one of the most studied problems for parallelism) is on parallelizing the 1d FFT. That is totally uniteresting from a beowulf perspective: the theoretical speedup (if I remember correctly) is only proportional to log(np) [np: # of processors], and that is almost certainly eaten up by the communication overhead on a beowulf. But two and higher dimensional FFTs are quite promissing (again theoretically): 2d FFTs can be split into two loops of 1d FFTs (one over rows, one over columns). Both loops can be run in parallel without any communication whatsoever [just choose the fastest 1d FFT routine you can find; I believe it is very hard to beat FFTW here]. The problem is that you have to do a matrix transpose between the two loops. FFTW does this using MPI_Alltoall. As Tony pointed out this may not be the best choice. So I've been busy writing a matrix transpose routine myself - something that is trivial for serial programs is totally nontrivial when the matrix is distributed over several processors. If somebody can give me a hint or reference how this is done for best performance, I'll be thankful forever :-) [right now I'm trying MPI_Type_vector to define the datatype to be sent from one processor to the others and then MPI_Isend and MPI_Irecv]. > b) Is there a way of reformulating the problem so that it is > embarrassingly parallel? That is, since you are scale limited anyway to > smallish "runs" that might well fit on a single CPU UP system, is there > any benefit to just running a bunch of these calculations independently > in parallel? I have similar tradeoffs in my Monte Carlo code -- I could > probably do really large lattices by splitting up the lattices between > systems and paying a surface-to-volume network IPC penalty, but critical > slowing down and the brutal quadratic scaling of statistics (need four > times as many samples to halve the error) kills me long before that > becomes a viable alternative. Instead I run lattices that fit easily > onto single CPU's (which are also small enough that I have a chance of > accumulating enough independent samples to get high precision results in > my lifetime) and concentrate on accumulating decent statistics at these > scales with independent parallel runs. Actually, I can: I have to average over random initial conditions. At least in principle - I find that usually 4 samples are already enough, even a single sample already gives you the growth exponent quite accurately. That's why I hoped that I could run a single run in parallel - then I could go to larger system sizes and later times. If it can't be done, I'll do exactly as you said. > >From your description it isn't clear that you can do this, but in "many" > of the problems that will run a really long time one finds that there > are obvious problem segmentations (exploring a parameter space in > parallel, accumulating statistics in stochastic simulations in parallel, > parallel processing independent images or datasets or initial > conditions) that render them EP. One hardly ever writes a program to > just run one (very long) time with just one set of input values. EP > programs generally scale (in the parallel speedup sense) nearly > perfectly on any hardware they'll fit on in the first place... That's absolutely correct in my case as well (if I just average over initial conditions). Unfortunately, the PDE doesn't have any parameters anymore - I can scale everything away. This is different from the Monte-Carlo simulations that I run to compare the PDE results with: Those are indeed embarrassingly parallel. Thanks to everybody for the comments. Martin ======================================================================== Martin Siegert Academic Computing Services phone: (604) 291-4691 Simon Fraser University fax: (604) 291-4242 Burnaby, British Columbia email: siegert at sfu.ca Canada V5A 1S6 ========================================================================
- Previous message: Beowulf & FFT
- Next message: Beowulf & FFT
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
