[Beowulf] RDMA NICs and future beowulfs

Mon Apr 25 17:26:33 PDT 2005

Vincent Diepeveen wrote:
> At 06:02 PM 4/25/2005 -0400, Mark Hahn wrote:
> 
>>>Would anyone on this list have pointers to
>>>which network cards on market support 
>>>RDMA (Remote Direct Memory Access)?
>>
>>ammasso seems to have real products.  afaikt, you link with their 
>>RDMA-enabled MPI library and get O(15) microsecond latencies.
>>to me, it's hard to see why this would be worth writing home...

FWIW, we installed one of the first Ammasso based clusters.  Still 
working on building some stuff for it.  Mostly for StarCD.

I won't comment on positioning other than to say that they occupy an 
interesting price performance niche.  Infiniband is dropping rapidly in 
price, and is getting more attractive over time.  As recent as 6 months 
ago, it added about $2k/node ($1kUS/HCA + $1kUS/port) to the cost of a 
cluster.  More recently the cost per port appears to be quickly moving 
towards 400 $US, and the HCA's are dropping so that you can add only 
$1kUS to the price per node for your cluster.

If you select the right switch, which you need anyway for your command 
and control net, you can get good port-port latencies.

I think the next reasonable question to answer is fundamentally what is 
the cost benefit analysis?  If you are performance bound and have an 
infinite budget, you need to look at the highest performance fabrics. 
Currently the Ammasso is not that.  If you need to optimize performance 
versus cost constraints, and your code gets some boost from the lower 
latencies vs ethernet, the question is whether or not the value of that 
performance is enough to justify the added cost of the cards.

>>>Would anyone have hands on experience 
>>>with performance, usability, and cost aspects
>>>of this new RDMA technology?
>>
>>they work, 

Agreed.  I would like to see a LAM implementation in addition to the 
MPICH.  The installation is actually not that hard, and I have a simple 
perl script to auto-generate an rnic_cfg file from your host IP 
according to some simple rules.

> but it's very unclear where their natural niche is.

Not sure I agree with this.  If there were no value in TCP offload, then 
why would Intel announce (recently) that they want to include this 
technology in their future chipsets?  Basically the argument that I make 
here is that I think there is a natural place for them, but it is on the 
motherboards.  Much in the same way you have an Graphics Processing 
Offload Engine in desktop systems, though in the case of motherboards, 
they found value in supplying the high performance interface rather than 
the offload engines.

I personally think that the offload engine concept is a very good one. 
I like this model.  With PCI-e (and possibly HTX), I think it has some 
very interesting possibilities.

>>if you want high bandwidth, you don't want gigabit.

Agreed.  Out of sheer curiousity, what codes are more bandwidth bound 
than latency bound over the high performance fabrics?  Most of the codes 
we play with are latency bound.

I wrote a message passing example for my class with a humorous name to 
illustrate passing vectors (or matrices).  I could easily turn this into 
a bandwidth test by using huge vectors.  But I am not sure that most 
folks are doing that in their codes.

>>if you want low latency, you don't want gigabit,

agreed ...

>> even RDMA-gigabit.

I think a CBA is worth doing (and we may do this).  If it gives a 5% 
boost for a 2% increase in cluster cost, is that worth it?  If it gives 
a 30% boost for a 2% increase in cost, is that worth it?  What 
fundamentally is the right cutoff (rhetorical question, cutoff varies 
based upon needs, funds, application,...)

[...]

> In reality the bandwidth/latency hunger gets even bigger in future when the
> cell type processors arrive. Correct me if i'm wrong, it really needs a
> branch prediction table for my branch intensive integer code, but even then
> such a processor is kicking butt. I mean 8 processing help units (SPE's) at
> 1 cpu and a main power pc processor. 
> 
> For floating point that's like 250 Gflop or so practical to their avail.

The cell will not automagically give you 250 Gflop.  It will not be easy 
to program.

> 
> That *really* will make the networks the weakest chain.

I haven't looked at the design in detail, but it looks like you are 
going to need a multistage resource scheduler to handle streaming data 
into the cell.  Think of it as a super-multi-core NPU that has a more 
general instruction set.

The Itanium has been out for a while and compilers for it are still 
maturing.  VLIW^H^H^H^H^EPIC is hard.  Anyone remember Trace Multiflows? 
  I would not expect to see a gcc for the cell (and now watch IBM make 
me eat my words).  I would expect that programming it is going to be a 
challenge.

[...]

> So obviously cell processor is kind of a step back for such software, but
> even then we can see a single cell 4.0Ghz probably like a 8 processor
> 2.8Ghz Xeon MP machine. 

Again, this is going to be difficult to program for in all likelihood 
(and if there are IBMers out there with this who know I am wrong, please 
let me know, or even better, let me at it :) ).  Good compilers are 
hard.  Very good compilers are rare.  Suboptimal compilers are the norm.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615