[Beowulf] RDMA NICs and future beowulfs

Vincent Diepeveen diep at xs4all.nl
Mon Apr 25 16:26:46 PDT 2005

At 06:02 PM 4/25/2005 -0400, Mark Hahn wrote:
>> Would anyone on this list have pointers to
>> which network cards on market support 
>> RDMA (Remote Direct Memory Access)?
>ammasso seems to have real products.  afaikt, you link with their 
>RDMA-enabled MPI library and get O(15) microsecond latencies.
>to me, it's hard to see why this would be worth writing home...
>> Would anyone have hands on experience 
>> with performance, usability, and cost aspects
>> of this new RDMA technology?
>they work, but it's very unclear where their natural niche is.
>if you want high bandwidth, you don't want gigabit.
>if you want low latency, you don't want gigabit, even RDMA-gigabit.
>the truth is that reducing networking overhead is always a bit of a 
>hard sell.  consider the appeal of avoiding context switches - 
>it sounds great, right?  how much of that appeal is based on the 
>(mistaken) impression that context switches are expensive?  
>lower CPU overhead also sounds great, but it made a lot more sense
>5-10 years ago when systems had .3 GB/s memory bandwidth (<5% of today).

Not really,

I prefer shipping a 100 MB/s to the other side of the cupper wire, and
receive 100 MB a second.

Practical i ship a megabyte or 40 a second, so that's 80MB/s in total. In
gigabit that loses a full cpu simply as it eats all its bandwidth.

Newer machines, new memory controllers and so on, i prefer shipping more
than that. Needs newer nics, 10 Gb nics that use the cpu for that will also
be too slow then. 

The problem is simply always there. At every system.

So the need for a highend network is *always* there.

>afaikt, gigabit-RDMA folks are really pinning their hopes on 10 Gb.

CPU's and memory bandwidth is also a lot bigger then, causing the cpu to
eat more data, again giving the same problem...

In reality the bandwidth/latency hunger gets even bigger in future when the
cell type processors arrive. Correct me if i'm wrong, it really needs a
branch prediction table for my branch intensive integer code, but even then
such a processor is kicking butt. I mean 8 processing help units (SPE's) at
1 cpu and a main power pc processor. 

For floating point that's like 250 Gflop or so practical to their avail.

That *really* will make the networks the weakest chain.

Now for the 'www.diep3d.com' chess program i have, it is just integer code
loaded with a 100k+ branches or so. Each year growing in Artificial
intelligence logics.

What mankind, well at least the top programmers, have learned is that the
only way to approach human nature is by creating a vaste amount of logic
rules and putting it in a single program, the whole then will behave 'semi
intelligent', if you combine it with a huge search that searches billions
of possibilities with that complex logics.

So obviously cell processor is kind of a step back for such software, but
even then we can see a single cell 4.0Ghz probably like a 8 processor
2.8Ghz Xeon MP machine. 

That means immense speedups even for software not intended to run on such
processors. Intel and AMD obviously need an answer to such awesome
processing power the cell approach promises. 

Note ideal would be a processor with say a core or 8 with 512KB cache
(256KB is a bit tiny) for each processing element. Of course a tiny branch
prediction table of say 2048 entries will do miracles. 18 cycles
misprediction penatly is no problem, as long as there is a branch
prediction table of some size. 

I hope those cell monsters get cheap and available for everyone, that will
simply FORCE the other manufacturers to produce their own 8 core cpu's :)

In either case, the data that 1 processor will generate and wants to
communicate to the world will increase incredible thanks to the additional
processing power and networks just barely keep up with it.

Whatever good physical reasons there are for that being the case, it makes
networks only a weaker chain.

Just look to processors. I ran in 2003 at 500Mhz MIPS processors,
delivering 1 gflop (origin3800, 512 processors, www.sara.nl). Now i jump
soon from that 1 gflop to 0.3 Tflop speed. 

Factor 300 increase in processing power within a few years.

In theory a node having 2 'highend' cell processors, each delivering 0.5
tflop; eating a bit more power than the 50-80 watt estimated for the 4Ghz
version. So in total delivering in total 1 tflop a node, it will generate
in a small box quite some data.

Correct me if i'm wrong; it will generate 4 terabyte data a second, when
talking about matrix multiplications.

How are beowulfs in future going to stream that away over the network?

>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit

More information about the Beowulf mailing list