low-latency high-bandwidth OS bypass user-level messaging for commodity(linux) clusters with commodity NICs(<$200), HELP! (GAMMA/EMP/M-VIA/etc.)

Donald Becker becker at scyld.com
Mon Dec 16 16:31:40 PST 2002

On Mon, 16 Dec 2002, jon wrote:

> Perhaps this isn't the best way to get ahold of you, but I've also sent
> this to the Beowulf list.  I've noted your comments on OS Bypass drivers
> in the past.  But isn't there some room for non-TCP/IP related traffic,
> such as with computing clusters?  We don't need no stinking TCP!  No
> associated revenue?  You could replace Myrinet in the thousands of nodes
> we have here ALONE at NCSA.

Very likely not.  Myrinet has both a cost and increasing performance
advantage over gigabit Ethernet when the switch is larger than about 96

> We (UIUC theoretical astrophysics group) are in the midst of purchasing
> a $50K cluster (I know, small, but big for us! :)) and I'm done all the
> research as to what we should be getting.  We ended up going with a
> Intel Desktop gigabit board and P4, but have found the tests to be very
> poor.  We only have 4 nodes right now because we worried about this very
> thing.

Latency or bandwidth?  And what are you using to test?

> InfiniBand

Hardware is just now appearing, after a rapid committee-driven
complexity increase.  The initial price is well above Myrinet to capture
the value to the "must-have" crowd.  The question is if there is the
motivation to travel down the price-volume curve

> Giganet using VIA
> ServernetII using VIA

Both effectively dead hardware products, although Giganet is still
shipping is current hardware.

> U-Net
   Dead software effort, pre-dated VIA protocol.

   Magic protocol using custom Myrinet firmware.  Grumble: early
   performance numbers were not reproducible (I got exactly 50% of tech report
   numbers on same hardware and software).
> PM, FM
  The other Myrinet custom protocols?  Dead.
  With a communication processor to do the work at the other end, you
  can do magic application-specific things.  And when you write the
  paper, the programming effort was mimimal and the performance

  The only thing I knew of by this name was a predecessor to IPMI.
  A google search shows only a low-speed serial communication project.

> Half of these are seemingly dead, those that seem relatively alive are:

Only half?  Do you have the same number of fingers on both hands?

A general guideline is that building a safe, reliable, general purpose
communication protocol is always much more difficult than getting
something that only works in the perfect conditions.  You must implement
checksums, sequence numbers, and recovery for failed endpoints.

If you are directly writing to a remote process memory space, you have
to take into account VM page table tracking and cache coherency.  These
can quickly erase any performance advantage of "zero copy".

> M-VIA: http://www.nersc.gov/research/ftg/via/
> Only support a few devices, and only 1 expensive ($500) gigabit board
> that's still available (the SysKonnect).

> GAMMA: http://www.disi.unige.it/project/gamma/index.html

The top project for current support.

> Depending on what part of their website you are at, they support
> different devices.  The Alteon TIGON-II results seem to suck for
> latency, which is our biggest problem.  The Netgear GA621 looks great!
> But we already bought a $5000 copper gigabit switch!  We are stuck with
> it! (HP Procurve 5308xl).  Whether they support the GA622 is kinda open
> or at least untested according to the website.  No luck getting in touch
> with driver writer about that.  Besides, EMP guy says the GA622 sucks!

While there are better Gigabit chips than the DP83820, most of its bad
reputation comes from the poor performance of the other drivers out
there.  We get quite reasonable performance from it with the Scyld
ns820.c driver.  Others have reported a 2.5-3X performance improvement
over the driver written by Red Hat.

> EMP: http://www.osc.edu/~pw/emp/
> Seems to be interesting, although the available 3Com 3C996, of which we
> have 3 to test, is only said to be "maybe" supported since it's Tigon 3
> and not Tigon 2.  And will it such in latency just like the Tigon 2?
> EMP guys says the GA-622T sucks with it's ns chipset and that was one
> option with GAMMA, assuming he really did write the driver for both 621
> and 622 (their website isn't clear about this, and no emails from the
> guys there), since the 622 "was" an option.
> My questions are:
> 1) Is there a commercial product for a not so expensive board that
> provides what GAMMA/EMP/M-VIA provide?  Any other OS-bypass driver/MPI
> layer I don't know about?

No commercial company is likely to support a communication protocol
unless they can pay for it (and have a hope of it working!) by bundling
it with expensive hardware.  We would support something on a best-effort
or time-and-materials basis.

> 2) Is there a solution I'm missing?  Has to be copper gigabit for linux,
> OS-bypass like GAMMA, MPI on top of that GAMMA-like.  No dead boards,
> etc.  Why are there no commercial products?  MPI/Pro is just a funny MPI
> still on top of TCP, no?

With custom versions available -- whatever you are willing to pay for.

> Honestly, I can't really figure out what Scyld does.  Is it just a linux
> distribution?  Does it actually have OS-bypass networking?  Does
> anything?

We are a Linux distribution specifically designed for clustering.  We
have various modification for higher network performance, but that's
actually used against us!  Our competitors say "Look, Scyld modifies the
kernel while we ship you a completely standard system."  Then, when
things don't work (as so often happen with complex systems), they get to
say "that's the standard Linux behavior, it's not our problem".

> Why is the OS-bypass so hard?  If wanting no TCP support, isn't it
> easier than writing standard linux driver? (like you've done a lot!)

It took 20 years to get TCP right...

Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993

More information about the Beowulf mailing list