[Beowulf] How to configure a cluster network

Fri Jul 25 11:35:22 PDT 2008

Hi Mark,
Thanks for helping keep the FNN meme alive while I've been "away". :-)

On Fri, Jul 25, 2008 at 1:38 AM, Mark Hahn <hahn at mcmaster.ca> wrote:
>> to generate a Universal FNN.  FNNs don't really shine until you have 3 or
>> 4
>> NICs/HCAs per compute node.
>
> depends on costs.  for instance, the marginal cost of a second IB port on a
> nic seems to usually be fairly small.  for instance, if you have 36 nodes,
> 3x24pt switches is pretty neat for 1 hop nonblocking.
> two switches in a 1-level fabric would get 2 hops and 3:1 blocking.
> if arranged in a triangle, 3x24 would get 1 hop 2:1, which might be an
> interesting design point.

Yes, of course the choice of FNN design parameters depends on cost, and
that 2-port HCAs are common for IB, so that should be considered.  My
comment about FNN's shining at the 3 or 4 NIC/node range is because of
the jump in node count you can support with a given switch size.  With only
2 NICs/node, the triangle pattern is pretty much all you can get, which
allows you to connect 50% more nodes than your switch size
(36 nodes w/24-port switches).  While, at 4 NICs/node, a Universal FNN
with 24-port switches can connect 72 nodes, 3x the switch size.
Now, the cost/node of the network goes up (relative to the 2-NIC/node FNN),
since you have twice as many wires , NICs and switch-ports (per node).

>> Though, as others have mentioned, IB switch latency is pretty darn small,
>> so latency would not be the primary reason to use FNNs with IB.
>
> yeah, that's a good point - FNN is mainly about utilizing "zero-order"
> switching when the node selects which link to use, and shows the biggest
> advantage when it's slow or hard to do multi-level fabrics.

My perspective on what is the best or most important aspect of a FNN
has shifted over the years.  I honestly think it really depends on the goals
of the cluster in question.  For some, the latency reduction is key.  For
others it is the guaranteed bandwidth between pairs of nodes (since no
communication link is shared between disjoint node pairs, communication
patterns that are permutations pass conflict free).  And for some it is the
potential cost savings to get "good" connectivity for more nodes than a
single switch can handle.  And another potential benefit is that you can
engineer the FNN to place more bandwidth between specified node pairs.
This latter benefit turned into my dissertation on Sparse FNNs, which
directly exploit a priori knowledge of expected communication patterns.
It is still yet to be shown in a practical installation that a Sparse FNN is
the right choice (politically or otherwise).  I don't know
of any implementations beyond our KASY0 machine from 2003.

>> I wonder if anyone has built a FNN using IB... or for that matter, any
>> link technology
>> other than Ethernet?
>
> I'm a little unclear on how routing works on IB - does a node have something
> like an ethernet neighbor table that tracks which other nodes are
> accessible through which port?

Ah, well, having never built an IB based FNN, I don't know the very low
level details of what would be required, but from what I understand about
how IB routing works, it would simply be a matter of setting up the
proper routing tables.  AFAIK, the Open MPI IB implementation would
figure it out automatically, as long as the disjoint IB fabrics had unique
IDs (equivalent to subnet address & mask for ethernet) (GIDs?).

> I think the real problem is that small IB switches have never really gotten
> cheap, even now, in the same way ethernet has.  or IB cables,
> for that matter.

Yeah, that is true.  Though, how much more do the larger switches cost?
What really counts is the ratio of small switch to large switch cost, assuming
you are trying to save money with a FNN, and that cables are not
ludicrously expensive.  Though, not every HPC installation is as
monetarily limited
as taxpayers might hope.  (Oh, did I say that out loud?)

Oh, another topic of discussion is how do many-core nodes change
the design space for cluster networks?  For instance, does the
network on Ranger have enough bandwidth on a per core basis?
As far as I can tell, each node has 16 cores, yet each node only has
one IB link?
That is some serious oversubscription if the cores are not talking locally.

> regards, mark hahn.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmattox at gmail.com || timattox at open-mpi.org
 I'm a bright... http://www.the-brights.net/