[Beowulf] Multidimensional FFTs

Tue Feb 28 17:15:08 PST 2006

Hi Bill

I've tested fft's rather extensively and run other codes that require  
a transpose.  In my experience, a well tuned gig-e network is capable  
of giving speed up, though not necessarily scaling that well.  The  
most important thing is that you have full bisection bandwidth.   
Anything less will reduce your scaling.  That is, if you use gig-e  
you can't trunk switches, you will need to stay within a single  
switch.  Typically, I've seen a 16 cpu job on gig-e gig about a 10  
times speedup.  Of course, it is processor/memory/nic dependant.

I've also run fft's on Quadrics Elan 3/4, IBM hps, and SGI Numalink  
4.  Since these are considerably higher bandwidth network they  
perform much better.  On a 16cpu job I've seen around 14 times speed  
up on these higher bandwidth networks.

As the size increases (say 256 cpu's) the networks that maintain full  
bisection bandwidth scale the best.  There are very few reasonably  
prices gig-e switches that maintain full bisection bandwidth at 256  
cpu's, while Quadrics and HPS do (though their starting price is  
high, at the larger system sizes, they become a realistic  
proposition).  Numalink falls away a little due to the weird network  
topology (dual plane quad bristle fat tree) which has drops in  
network connectivity/cpu as the system gets larger.

If you want to go with gig-e a few things to be aware of:

*The nic matters (pro1000MT's give 10-15% better performance that  
pro1000T's)

* Go with single cpu nodes - higher per cpu network bandwidth

* If you get dual core cpu's, treat it as a single core node (allow  
the 2nd core to do all the tcp stuff)

I've played around with multiply connected nodes (nodes that have  
dual ported nics) and the 2'nd nic doesn't give you much (10-15%) and  
requires a fair bit of stuffing around to get it working well.  I  
think you would be better of running your global fs and other  
services over 1 nic and your mpi traffic over the other.  At least  
this way, your fs and services shouldn't be stealing your bandwidth.

You may even try running mpi-gamma on the 2nd nic, which should give  
you better bandwidth, hence better scaling (I haven't tried this).

If you want real measured numbers, drop me a personal email.

Stu.

On 01/03/2006, at 2:26, Bill Rankin wrote:

> Hey gang,
>
> I know that in the past, multidimensional FFTs (in my case, 3D)  
> have posed a big challenge in getting them running well on  
> clusters, mainly in the areas of scalability.  This is somewhat due  
> to the need for an All2All communication step in the processing  
> (although there seem to be some alternative approaches here).
>
> There is a research group here at Duke doing some application  
> development and they are looking at implementing their codes in a  
> cluster environment.  The main problem is that 95% of their  
> processing time is taken up by medium to large sized 3D FFTs  
> (minimum 64 elements on an edge, 256k total elements).
>
> So I was wondering what the current "state of the art" is in  
> clustered 3D FFTs?  I've googled around a bit, but most off the  
> results seem a little dated.  If someone could point me to any  
> recent papers or studies, I would be grateful.
>
> Some specifics that I am interested in would be a good comparison  
> of different interconnects on overall performance, as this will  
> have a significant impact on the design of their cluster.
>
> Thanks,
>
> -bill

--
Dr Stuart Midgley
sdm900 at gmail.com