[Beowulf] RDMA NICs and future beowulfs
Joe Landman
landman at scalableinformatics.com
Mon Apr 25 17:26:33 PDT 2005
Vincent Diepeveen wrote:
> At 06:02 PM 4/25/2005 -0400, Mark Hahn wrote:
>
>>>Would anyone on this list have pointers to
>>>which network cards on market support
>>>RDMA (Remote Direct Memory Access)?
>>
>>ammasso seems to have real products. afaikt, you link with their
>>RDMA-enabled MPI library and get O(15) microsecond latencies.
>>to me, it's hard to see why this would be worth writing home...
FWIW, we installed one of the first Ammasso based clusters. Still
working on building some stuff for it. Mostly for StarCD.
I won't comment on positioning other than to say that they occupy an
interesting price performance niche. Infiniband is dropping rapidly in
price, and is getting more attractive over time. As recent as 6 months
ago, it added about $2k/node ($1kUS/HCA + $1kUS/port) to the cost of a
cluster. More recently the cost per port appears to be quickly moving
towards 400 $US, and the HCA's are dropping so that you can add only
$1kUS to the price per node for your cluster.
If you select the right switch, which you need anyway for your command
and control net, you can get good port-port latencies.
I think the next reasonable question to answer is fundamentally what is
the cost benefit analysis? If you are performance bound and have an
infinite budget, you need to look at the highest performance fabrics.
Currently the Ammasso is not that. If you need to optimize performance
versus cost constraints, and your code gets some boost from the lower
latencies vs ethernet, the question is whether or not the value of that
performance is enough to justify the added cost of the cards.
>>>Would anyone have hands on experience
>>>with performance, usability, and cost aspects
>>>of this new RDMA technology?
>>
>>they work,
Agreed. I would like to see a LAM implementation in addition to the
MPICH. The installation is actually not that hard, and I have a simple
perl script to auto-generate an rnic_cfg file from your host IP
according to some simple rules.
> but it's very unclear where their natural niche is.
Not sure I agree with this. If there were no value in TCP offload, then
why would Intel announce (recently) that they want to include this
technology in their future chipsets? Basically the argument that I make
here is that I think there is a natural place for them, but it is on the
motherboards. Much in the same way you have an Graphics Processing
Offload Engine in desktop systems, though in the case of motherboards,
they found value in supplying the high performance interface rather than
the offload engines.
I personally think that the offload engine concept is a very good one.
I like this model. With PCI-e (and possibly HTX), I think it has some
very interesting possibilities.
>>if you want high bandwidth, you don't want gigabit.
Agreed. Out of sheer curiousity, what codes are more bandwidth bound
than latency bound over the high performance fabrics? Most of the codes
we play with are latency bound.
I wrote a message passing example for my class with a humorous name to
illustrate passing vectors (or matrices). I could easily turn this into
a bandwidth test by using huge vectors. But I am not sure that most
folks are doing that in their codes.
>>if you want low latency, you don't want gigabit,
agreed ...
>> even RDMA-gigabit.
I think a CBA is worth doing (and we may do this). If it gives a 5%
boost for a 2% increase in cluster cost, is that worth it? If it gives
a 30% boost for a 2% increase in cost, is that worth it? What
fundamentally is the right cutoff (rhetorical question, cutoff varies
based upon needs, funds, application,...)
[...]
> In reality the bandwidth/latency hunger gets even bigger in future when the
> cell type processors arrive. Correct me if i'm wrong, it really needs a
> branch prediction table for my branch intensive integer code, but even then
> such a processor is kicking butt. I mean 8 processing help units (SPE's) at
> 1 cpu and a main power pc processor.
>
> For floating point that's like 250 Gflop or so practical to their avail.
The cell will not automagically give you 250 Gflop. It will not be easy
to program.
>
> That *really* will make the networks the weakest chain.
I haven't looked at the design in detail, but it looks like you are
going to need a multistage resource scheduler to handle streaming data
into the cell. Think of it as a super-multi-core NPU that has a more
general instruction set.
The Itanium has been out for a while and compilers for it are still
maturing. VLIW^H^H^H^H^EPIC is hard. Anyone remember Trace Multiflows?
I would not expect to see a gcc for the cell (and now watch IBM make
me eat my words). I would expect that programming it is going to be a
challenge.
[...]
> So obviously cell processor is kind of a step back for such software, but
> even then we can see a single cell 4.0Ghz probably like a 8 processor
> 2.8Ghz Xeon MP machine.
Again, this is going to be difficult to program for in all likelihood
(and if there are IBMers out there with this who know I am wrong, please
let me know, or even better, let me at it :) ). Good compilers are
hard. Very good compilers are rare. Suboptimal compilers are the norm.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 734 786 8452
cell : +1 734 612 4615
More information about the Beowulf
mailing list