[Beowulf] RDMA NICs and future beowulfs

Mon Apr 25 22:42:18 PDT 2005

> I won't comment on positioning other than to say that they occupy an 
> interesting price performance niche.

"interesting" in the fortune-cookie sense?  I think so.  
gigabit is so cheap that it's barely worth building cards for it;
switches are also extremely commoditized.  OTOH it's not impossible to 
"add value" to commodity parts (Cisco adds PHB-comfort, for instance.)

> Infiniband is dropping rapidly in 
> price, and is getting more attractive over time.  As recent as 6 months 
> ago, it added about $2k/node ($1kUS/HCA + $1kUS/port) to the cost of a 
> cluster.  More recently the cost per port appears to be quickly moving 
> towards 400 $US, and the HCA's are dropping so that you can add only 
> $1kUS to the price per node for your cluster.

I claim that IB is so drastically faster than GBE-based RDMA
that this is not a useful comparison.

> > but it's very unclear where their natural niche is.
> 
> Not sure I agree with this.  If there were no value in TCP offload, then 
> why would Intel announce (recently) that they want to include this 
> technology in their future chipsets?

mainly the PHB factor, but also Intel is looking to distinguish itself,
and also to use up all those extra transistors.  but ask the hardcore 
linux network-stack folk what they thing of TOE, I dare you ;)

Intel used to go on an on about IB in the chipset too, no?

> Basically the argument that I make 
> here is that I think there is a natural place for them, but it is on the 
> motherboards.  Much in the same way you have an Graphics Processing 
> Offload Engine in desktop systems, though in the case of motherboards, 
> they found value in supplying the high performance interface rather than 
> the offload engines.

it's a good argument.  but it's based on a sort of chip-level competitive 
advantage.  is there enough work in doing TCP that it's worth bothering 
with a TOE?  bear in mind that a TOE is inherently a coprocessor running 
its own firmware, probably a small RTOS, etc.  what do you do when there's 
a bug in your TOE?  how about a DOS in your TOE's firmware?

note also that GPU's have a huge amount of leverage to apply: current
high-end ones are >30 GB/s onboard, with many, many parallel ALU's.
since graphics is inherently very data-parallel, this approach works,
and is not crippling to design.  I just don't see where RDMA can get 
enough leverage to make a difference.

> I personally think that the offload engine concept is a very good one. 
> I like this model.  With PCI-e (and possibly HTX), I think it has some 
> very interesting possibilities.

I think there's a place for a very simple state-machine kind of smart nic.
the idea would be to set up a table of actions for the nic to take, based
on various fields in the incoming packet.  for instance, tell it that when
it receives a syn for a particular host/port/seq, it can send the packet 
at a particular memory location.  or if it receives a particular
host/port/seq that it can send a particular packet (which might be an ack).
you can easily imagine that this could handle the socket handshake,
even RDMA-type things.  but it's certainly not TOE in the normal sense,
and it wouldn't require a coprocessor+firmware+rtos

> >>if you want high bandwidth, you don't want gigabit.
> 
> Agreed.  Out of sheer curiousity, what codes are more bandwidth bound 
> than latency bound over the high performance fabrics?  Most of the codes 
> we play with are latency bound.

I have users who do cosmology who seem to like shipping around pretty big
chunks of data (cached data from neighbors on the grid).  I have another 
user who is quite memory-intensive: ships around large chunks of his 
very, very large matrices.

> I think a CBA is worth doing (and we may do this).  If it gives a 5% 
> boost for a 2% increase in cluster cost, is that worth it?  If it gives 

does it actually give a 5% boost?  also, are your nodes quite cheap?

> a 30% boost for a 2% increase in cost, is that worth it?  What 

I have the impression Ammasso cards are in the $200 range, but this is 
purely an impression.  as such, they'd represent 10% or so of a fairly
"normal" low-end cluster node.  the main problem is that low-end cluster
nodes tend to run serial jobs which have essentially no dependence on 
the network.  most other users (here at least) are significantly more net-intensive,
and want 2 us latency and 1 GB/s.  I guess I'm suggesting the RDMA niche
depends on an odd category of parallel-but-not-very applications.

I find that once people get that first hit of parallel, they usually
want more, and that rapidly turns them into IB/Myri/Quadrics junkies...

regards, mark hahn.