Any news on Infiniband ?

Patrick Geoffray patrick at myri.com
Wed Feb 26 22:50:25 PST 2003


On Thu, 2003-02-27 at 00:46, Anthony Skjellum wrote:
> Patrick, aren't you from a competitor vendor? :-)

Of course, and all of my statements have to be taken in this context :-)

> On 26 Feb 2003, Patrick Geoffray wrote:

> > > real applications), we did provide this number.  We have found almost universal
> > > advantage to avoid polling MPI in almost all networks, including Ethernet+TCP
> > > Giganet, and Myrinet.

This was the source of my comment: your universal advantage to avoid
polling is related to your MPI implementation, it's a software design
choice. That does not mean blocking is the universal solution for
messages completion.

> If you have hardware progress of MPI level, you don't use software to do it.
> I am only aware that Cplant Portals has this in their MCP...

It's interesting to see that there were hardware providing such
capabilities for some time now, and nobody never really did take fully
advantage of it. I think it was because the people doing the low level
stuff were not in touch with the people doing MPI. It's also interesting
that it's changing now, maybe under the pressure of IB.

> > One of the flaws of IB is to use a paradigm built on VI. MPI did not map
> > well on VI, and I expect the same thing for IB.
> We had good luck with MPI/Pro over VI

I think the fundamental advantage of MPI/Pro was to assert the
progression problem and try to improve communication/communication
overlapping. I don't think VI did provide any specific features to make
it better. 
* VI is connection oriented, that means scalability issues and high
setup overhead. 
* Large descriptors based, that means high latency: the latency on
Giganet was way more than it should have been for a pure silicon
solution. 
* Memory registration is explicit, that means it was not optimized and
it was a nightmare to put it out of the critical path.
* Matching space is too small to match MPI, so you need a progression
engine in the host.

> general statement suggests.  In fact, Dell did independent comparisons
> of Myrinet and Giganet and found much lower overheads for Giganet than with

There is no such things as "independent comparaisons" specially when
Dell and Giganet had a distribution deal. It's called marketing.

> GM at the time... it was quite a good technology at scales up to 128.

The good part of Giganet was, IMHO, the packet engine built on top of a
ATM chip, with a very good medium message pipeline. VI was in fashion at
that time, and it sure looked appealing to jump on the bandwagon
initially pushed by Microsoft, Compaq and Oracle. There was the same
appeal for IB a few years ago. But let me remind you the current state
of VI: dead in the water.

> I'd recommend the papers of Jenwei Hsieh and Tau Leng to everyone, to look at
> how, even with the need for a progress thread, large Giganet transfers
> were only using 3% or so of CPU, whereas similar Myrinet was in the 20%+
> range (as far as I recall, but see the white papers).

This is the specific advantage of the progression thread. This is an MPI
implementation trade-off to deal with GM's constraints (that have
nothing to envy to VI constrains BTW). 
 
> What evidence can you offer that MPI doesn't work well over VI?

VI, GM and IP don't work well for MPI. They do not share MPI semantics,
and MPI implementations built on top of them use a lot of code just to
work around them. If you have to have a progression thread, or cache the
memory registration, or do the rendez-vous/matching yourself, it's not
designed for MPI.

> As for IB, there are reliable connections, but also RD, which is quite
> interesting to look at ... The connection you draw is a confusing one.

I will add IB to the list of communication layers that are not designed
for MPI (VI, GM and IP). The IB trade association didn't think one
second about MPI when writing the huge specs. Having Reliable Datagram
won't help one bit to write a more efficient MPI middleware.

> In fact, the truth is that IB will work very well for small and medium scale
> (maybe to 1000 nodes), before suffering problems with connections and other
> issues.  It will probably be quite convenient to use, easy to multiprogram,
> and offer a lot more robustness than you can get with a weak NIC.

You don't want to hear my truth about IB because I am completely biased
and it won't be pretty. Specifically about MPI, IB will work as well as
VI or GM, that means on 3 legs with 2 iron balls on each and walking
backward.
 
> The picture of the technologies is much more interesting than your mail
> suggests, and not pointing all for one technology or the other.

For the HPC world, the de facto communication standard is MPI. To
support MPI effectively, you need to have a native support from the
hardware, either in silicon or in firmware. There is 2 solutions today
that can provided a such support, it's Quadrics and Myrinet. When
everybody will be soon at 4X (which IMHO overkilled to the current state
of PCI and machines), the difference for the HPC community will be on
the efficiency of MPI. For that, I think IB is like VI, it sucks :-)

Patrick
-- 

Patrick Geoffray, Phd
Myricom, Inc.
http://www.myri.com




More information about the Beowulf mailing list