[Beowulf] SC|05: will 10Ge beat IB?
Patrick Geoffray
patrick at myri.com
Sun Nov 27 05:31:04 PST 2005
Hi Gary,
Gary Green wrote:
> Jeff points out one of the two issues slowing down adoption of 10GigE in
> HPC. The first bottleneck is the lack of a cheap high port count
> non-blocking 10GigE switch.
Add to that "wormhole" and "level-2 only". The Ethernet spec pretty much
requires a store-and-forward model to implement the spanning tree, but I
was pleasantly surprised to hear that most existing Ethernet switch
vendor (in addition to the new kids on the block) can deliver wormhole
capability. However, the cost of existing switches is often related to
IP routing functionalities, and
> The second issue is the lack of a broadly adopted RDMA implementation. It
> appears that there is movement forward in this arena as well with the move
> to adopt the verbs interface from the OpenIB group.
I think the RDMA hype is finally loosing steam. People realize that, at
least for MPI, RDMA does not help. Using memory copies (ie not doing
zero-copy, ie not doing RDMA) is faster for small/medium messages which
represent the vast majority of messages in HPC. Furthermore, RDMA-only
comes with its share of problems (memory scalability is a major one)
that cannot be ignored much longer.
With system call overhead below 0.3us these days, OS-bypass may
eventually join zero-copy/RDMA in the list of features
once-useful-but-not-so-much-anymore.
> But as someone pointed out in a previous email, just as 10GigE will surely
> catch up to where IB is today, by that time, IB will be at the next
> generation with DDR, QDR, 12X, etc, etc, etc...
With DDR, you reduce your copper cable length by half. Even more for
QDR. How will you connect hundreds of nodes with 5M fire-hose cables ?
Today's fibers can carry 10Gb/s of data (12.5 Gb/s signal). You can push
40 Gb/s data with very expensive optics, but it makes sense only for
inter-switch links. Furthermore, which IO bus would you use ? With a
PCI-Express 8x slot, you can barely saturates a 10Gb/s data link (for
the curious, PCI-Express efficiency is not great, the MTU is usually 64
Bytes).
The only way to be able to feed that much data will be to be on the
motherboard, through an HT connection or directly on the memory bus. Not
commodity anymore, and nobody makes money in the custom-motherboard
market. And in the end, you will realize that the extra bandwidth buys
you little.
> There is also the question as to whether GigE will be able to demonstrate
> the ultra low latencies seen in high performance interconnects such as
> Quadrics, InfiniBand and Myrinet. In the end, there will most likely remain
> a market for high performance interconnects for high end applications.
If the Ethernet switch latency comes to the 200ns range, pure Ethernet
will be in the same latency range as everything else.
Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
More information about the Beowulf
mailing list