[Beowulf] Intel pulls networking onto Xeon Phi
atchley at tds.net
Wed Dec 4 07:49:15 PST 2013
On Tue, Dec 3, 2013 at 10:45 PM, Greg Lindahl <lindahl at pbm.com> wrote:
> On Mon, Dec 02, 2013 at 08:41:26AM -0500, atchley tds.net wrote:
> > On Mon, Dec 2, 2013 at 8:37 AM, atchley tds.net <atchley at tds.net> wrote:
> > > I am not sure what Aries currently offers that IB does not.
> The IB in question is the True Scale adapter, which does some things
> really fast and other things pretty slowly. Aries has different
> features (quite different and more capable than IB, really), and is
> much larger.
> To put this into perspective, I suspect the typical modern Ethernet
> adapter has more gates than True Scale. If you're going to add
> something to a CPU, it had best be small. CPU guys get really irate if
> you reduce their yield.
Given that the article said that it could borrow things not found in
Ethernet or IB controllers, I don't think it was meant to be TS only.
What Aries features do you have in mind that are different and/or more
capable? Are they expressed by the uGNI interface? I find the uGNI
interface to be a strict subset of IB and the hardware has some interesting
> > > As Myricom showed with MX over Ethernet followed by
> > > Mellanox with RoCE, you can get low latency over Ethernet bypassing the
> > > kernel and the TCP stack.
> Indeed, Myricom+MX is quite similar in concept to the IB extension
> found in True Scale. The main difference (in my mind) is that OpenMX
> is hosted on a not-optimized-for-MX generic ethernet adapter, and that
> Myrinet's hardware was not fully optimized to do exactly what MX
> needed, nothing more and nothing less.
Hmm, I was not speaking of OpenMX. It was a minor change to run the native
MX on Myricom's NICs in Ethernet mode. The latency was no different than
running MX on Myrinet until you went through a switch.
OpenMX is API, binary, and wire compatible with MX such that one could run
MX on a Myricom 10G NIC connected to an Ethernet switch and OpenMX on a
non-Myricom 10G Ethernet NIC connected to the same switch. Because OpenMX
does not require specialized hardware, it sits atop the Ethernet driver in
the kernel. It is not kernel-bypass, but it does bypass the full IP stack.
> True Scale is the smallest
> possible adapter that supports the basics of what MPI needs.
> The fabric is pretty irrelevant, as long as it has flexible
> routing. (See below for comments about SDN.)
> > > HPC sends a lot of small messages and various stacks are making use of
> > > 8-byte atomics. It is unhelpful to have a 64 byte minimum frame size in
> > > this case.
> Yes, a smaller frame size is quite nice for achieving high message
> rates for tiny packets.
> > > Ethernet topology discovery protocols were designed for environments
> > > equipment can be changed out, expanded, or otherwise altered.
> This has changed in the new SDN (software defined networking)
> world. You can think of SDN on Ethernet as Infiniband management
> protocols implemented in ethernet, making many of the same mistakes
> that Infiniband did, plus some new ones.
SDN is the Big Data of networking. It can be anything you want. ;-)
> > Ethernet requires a single-path between any two endpoints.
> This is not true. It's more accurate to say that ethernet (especially
> TCP) benefits from in-order delivery, which you can ensure either on a
> host-host basis (which is what spanning tree provides) or on a
> per-flow basis (which is what SDN allows.)
> Personally, I'm a bit bummed this won't happen until 2015 :-( but I'm
> really excited to see True Scale's basic design continue into another
You are confusing a transport layer with the L2 layer. Ethernet does not
care about order. It does not care about reliability. These are provided at
SDN has many uses. You can provide per-flow routing, but it can do much
I thought most SDN efforts were to provide routing between fabrics.
Ethernet, by definition, is non-routing and is a common broadcast domain. I
though STP et al took care of multiple paths to ensure single path between
any two MACs. Is it possible for one host to send a broadcast and another
host to receive multiple copies?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf