[Beowulf] choosing a high-speed interconnect

Tue Oct 12 22:12:31 PDT 2004

Hi Matt:

  Good to see you here ... :)

Matt L. Leininger wrote:

>  
>
> 
>  There are multiple 128 node (and greater) IB systems that are stable
>and are being used for production apps.  The #7 top500 machine from
>RIKEN is using IB and has been in production for over six months.  My
>cluster at Sandia (about 128 nodes) is being used for IB R&D and
>  
>

FWIW I used the nice setup that the AMD Dev center team have set up for 
benchmarking and testing.  They have a nice IB platform there.

[...]

>   QP scaling isn't as critical an issue if the MPI implementation sets
>up the connections as needed (kinda of a lazy connection setup).  Why
>set up an all-to-all QP connectivity if the MPI implements an all-to-all
>or collectives as tree based pt2pt algorithms.  Network congestion on
>larger clusters can be reduced by using source based adaptive
>(multipath) routing instead of the standard IB static routing.  
>  
>

On features utility ... (qp scaling, ...)  (more to Mark than Matt here)

One of the things I remember as a "feature" much touted by the 
marketeers in the ccNUMA 6.5 IRIX days was page migration.  This feature 
was supposed to ameliorate memory access hotspots in parallel codes.  
Enough hits on a page from a remote CPU, and whammo, off it went to the 
remote CPU.

Turns out this was "A Bad Thing(TM)".  There were many reasons for this, 
but in the end, page migration was little more than a marginal feature, 
best used in specific corner cases.  Sure, someone will speak up and 
tell me how much pain it saved them, or made their code 3 orders of 
magnitude faster.  I never saw that in general.  I got better results 
from dplace, and large pages than I ever got from some of these other 
features.

The point is that there are often lots of features.  Some of which might 
even be generally useful.  Others might simply not be useful as the 
application level issues might be better served by other methods (as you 
pointed out). 

IB works pretty nicely on clusters.  So do many of the other 
interconnects.  If you have latency bound or bandwidth bound problems, 
certainly it would be worth looking into.

The original question was which to look at.  First the need needs to be 
assessed, and from there, a reasonable comparison may be made.  IB does 
look like it is drawing wide support right now, and is not single 
sourced.  It may be possible (though I haven't done much in the way of 
measurement) that tcp offload systems might help as well.  If you are 
not extremely sensitive to latency, you might be able to use these.  If 
you are, you should stick to the low latency fabrics.

>  Also remember that IB has a lot more field experience than the latest
>Myricom hardware and MX software stack.  
>  
>

Joe