The FNN (Flat Neighborhood Network) paradox

Wed Feb 21 10:11:26 PST 2001

On Wed, 21 Feb 2001, Miroslaw Gawenda wrote:

> Thanks for great idea - I will try to check this.
>
> I will try to prepare first 4 node beowulf tonight :)
> I will put to every node 3 nics and I connect this without any switch.
> After this I will put in every node only one nic and I compare the
> benchmark results.

If I understand you, you are planning to build a hypercube.  I've played
briefly with hypercubes, and concluded that (like the FNN) it is an idea
that makes sense at all only in a market topology where switches are
very expensive compared to NICs.

Remember, even for a tetrahedral four node beowulf, you must spend
perhaps $20 (or more, of course) per NIC, or $60 per node.  Four nodes
require 12 NICs and six cables for a total cost of around $270 or more.
Yes there are cheaper NICs available (e.g. RTL8139's) but I've had
terrible luck with these in the one hypercube I tried to build with them
and still have a whole stack of them sitting in my junk hardware pile as
a consequence.

A five port switch costs perhaps $70 (or less if you shop hard) -- eight
port switches are as little as $80.  A switched port costs LESS than a
NIC these days.  Admittedly these switches are likely to be
store-and-forward with mediocre latency, but even better switches aren't
that expensive anymore.  Add in only FOUR NICs and cables @$25 each, and
you can get effortless connections for only $170 and have an extra port
to connect up a head node or to another switch.

The other things to consider are:

   a MUCH more complicated topology.  Routing tables have to be built to
manage each node's path to the other hypercubical nodes.
   a MUCH higher latency if you you go beyond the number of NICs your
PCI bus can hold (also you have to turn on real routing and build really
complicated routing tables).
   a MUCH greater human cost to build it and maintain it (see
"complicated" in the previous two entries).
   finally, in order to get the advantage of possibly aggregate
bisection bandwidth, you mustn't be blocked at e.g. the kernel level so
that you are effectively only using one NIC at a time anyway.  Expensive
NICs may use DMA and a carefully written application (one with
nonblocking I/O, for example) may then allow you to get some advantage
in terms of aggregate bandwidth, but cheap NICs or careless applications
probably won't.

In my own experimental hypercube, aggregate internode performance was
actually measurably worse than on a switch and attempting to talk on
all channels in parallel actually destabilized the kernel of that day
and caused systems crashes (early 2.2.x's).  Which made average
internode performance REALLY bad when the crash recovery program was
taken into account.  This could likely all have been resolved -- with a
lot of work.  Instead I went out and bought a (then) $220 8 port switch
and never looked back.

In conclusion, one is as likely to get WORSE overall performance (unless
one works very hard to tune up or uses a package like the channel
bonding package where others have done the work for you), work MUCH
harder (which is a real cost), and pay a lot more (which is a real
cost).  Higher cost, less benefit.

The FNN solution shares some of the features of the hypercube -- if one
has to buy the switches it is more expensive unless you're talking about
a really big flat network.  One has to manage routing tables so
complicated that only an optimization (e.g. simulated annealing,
genetic) algorithm is capable of building them -- they are analogous to
solving the N Queens problem in chess (in fact they are probably
equivalent to the inverse of the N Queens problem or something like
that).

One still has to worry about the kernel's ability to manage network
transactions on 3 channels as efficiently as on one.  It may or may not
lead to greater aggregate bisection bandwidth, depending on DMA and how
the application is written and how reliably the NIC device driver is
integrated with the kernel (interrupts on multiple devices using the
same driver obviously have to be carefully and predictably resolvable).
A NIC without DMA will obviously just block anyway until a transmission
is completed -- two transmissions can never be resolved in parallel.

Channel bonding, on the other hand, solves a very different problem --
how to get more raw internode bandwidth using any given kind of NIC.
This is also "expensive" in human time and possibly system time, but it
may be the cheapest (or only!) way to get internode bandwidth in certain
exotic ranges.  If you have a parallel application that is not
particularly sensitive to latency but needs huge interprocessor
bandwidth to scale well, it can easily be your ticket to a functional
design (for a largish but still COTS price).  If I recall what Don
Becker once told me, your aggregate bandwidth increases, not quite
linearly, for up to three NICs but the fourth (at the time of the
discussion, not necessarily now) was either not worth it or actually
decreased aggregate bandwidth a bit.

  rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu