[Beowulf] split traffic to two interfaces for two "subnets"

Robert G. Brown rgb at phy.duke.edu
Thu May 11 05:16:29 PDT 2006


On Wed, 10 May 2006, Yaroslav Halchenko wrote:

> if I understand correctly, you are suggesting to create two /24 networks
> (not a single /23); since they all will be routed by the same switch anyway, I
> think that routing should not be an issue, nor bottleneck...

The problem is that there are layer two and layer three issues both to
consider.  The switch doesn't "route" except at layer two (ethernet).
It maintains a table of what MAC addresses it hears at the end of each
wire and sends packets with filled-in destination ethernet addresses to
the right wire, regardless of layer three packet headers (this is for a
standard cheap non-VLAN non-managed switch -- if your switch is really
what I would have called a router back in the day and can manage
IP-level switching than the entire discussion is moot -- put your two
networks on different class C's, set up two class C vlans on your
switch, set interface routes on your server, and you are REALLY done).

Packets with IP addresses filled in where the sending machine doesn't
know the destination MAC address via ARP table have to be routed, either
to the default gateway or otherwise (see output of the route command on
any client).  They'll often generate an ARP broadcast first asking for
the right address.  IP broadcasts will generally be wrapped in ethernet
broadcasts or sent to the gw and THEN wrapped in ethernet broadcasts,
I'm not sure -- anyway, they have to be wrapped in ethernet in such a
way that all hosts can see them to see if the IP addresses match up.

With a default route that is "the wire" (the interface itself, not a
host on the interface) packets for both IP networks will go onto that
wire and hence be switched to the right host and this will even happen
efficiently as soon as the switch and both ends of the connection have
spanning MAC and ARP tables.

A default route of "the wire" will not work on the server, however.
There you'll have to manually (configure to) set the route to 10.0.0. on
eth0 and a second route to 10.0.1 on eth1.  This isn't because traffic
sent out on the wrong interface won't get where it needs to go, it is
because it WILL get where it needs to go, but not the way you want since
you are trying to split up the traffic.  It won't do you much good if
client requests for a file come in to eth0 and eth1 from clients on the
two networks just the way you want but all the DATA gets sent BACK from
eth0 because that is the only/default IP route.

> and what do you mean by segregating broadcasts? shouldn't it be simple
> DNAT from lets say 10.0.0.255 to 10.0.1.255 so it gets heard?

Again this is layer two/layer three issue and it isn't MUCH of an issue.
In many cases you might want broadcasts NOT to be forwarded between the
two subnets, especially if clients don't ever need to talk to one
another.  Broadcasts (other than ARP) between clients are usually
unnecessary, broadcasts (say of RARP/PXE associated traffic at boot) to
the yet-unknown server needs to be able to REACH that server but tend to
be rare and task specific.  Second, there are both layer 2 and layer 3
broadcasts.  On a shared switch all ethernet broadcasts will reach all
hosts on both IP subnets.  I think that IP broadcasts will also reach
all hosts as I expect that they'll go out in ethernet broadcast packets
(can't remember for certain here) but of course they'll only be accepted
by hosts with matching IP subnets.

This is the thing that using two switches would prevent.  Ethernet
broadcasts would not be forwarded, nor would IP broadcasts.  As you say,
if you turn on NAT and packet forwarding between the two interfaces and
subnets, clients on one side CAN talk to clients on the other, but it is
much more "expensive" to do so:  your latency goes way up, your bw goes
way down, and during high traffic communications you load your server's
CPU and network interface(s).  Therefore it isn't recommended if you're
using ethernet-based IPCs between clients on the two subnets.  In that
case you should leave the ethernet flat and permit ARP to propagate
across both IP networks.  I think.

> unfortunately that would not fly since the older switch (I am about to
> substitute with a new 44 ports DLink) doesn't support jumbo frames which
> from my experiments on a new switch help a lot in terms of throughput
> and CPU load.

Right, so from this it is apparent that you ARE using ethernet for IPCs
so splitting your networks on different switches really isn't an option
even without jumbo frames (unless they are master/slave computations and
the clients don't need to talk to each other, only to the
server/master).

Besides, there aren't a LOT of advantages to splitting the networks
anyway.  You don't have enough hosts on either subnet to make broadcast
traffic much of an issue unless you are doing something strange or your
configuration(s) are a bit broken.  On a very crowded, large, flat
network you can get to where there is enough broadcast traffic that it
starts to impact performance a bit -- remember, TCP/IP has to be
processed by the CPU so even IGNORING a packet as "not for me" requires
SOME effort and diverts the CPU away from other tasks while it is going
on.  With only about a dozen hosts per subnet and 24-25 total I'd guess
that this is totally ignorable at worst and can be squeezed down to near
nothing by appropriately configuring the hosts (not running services
that generate a lot of broadcasts:-).  In any event, tcpdump or the like
will let you listen on any interface and measure the total passive
traffic.  On a subnet with maybe ten hosts on it I see a steady stream
of low-grade broadcast traffic, mostly from hosts doing network
maintenance tasks -- e.g. arp packets to the gw, ipp packets.  Probably
averages a few packets per second.

Multiply by ten and it is still a trivial fraction of network/processing
capacity.  Multiply by twenty or thirty, add a few "problem children"
that are misconfigured or doing work that generates a lot of associated
broadcast traffic, and it CAN become a problem, given the minimum
latency issues per packet.  This is, after all, one of several reasons
switches more or less wiped out repeater hubs especially in cluster
applications -- imagine the impact on performance if you have ~100 hosts
on a SHARED hub so that each host "sees" the traffic for all hosts. Even
though it doesn't accept the payload for other hosts, it has to at least
examine the destination IP address in the packet header of each packet
to decide to ignore the packet, one packet at a time.  On big, flat
10Base networks ~15-20 years ago I remember the network being "brought
to its knees" by certain ill-behaved network protocols (e.g. decnet)
that shared the wire and thought it perfectly reasonable to broadcast
every second or so...

So anyway, yeah, what you propose should work, I think, as long as you
manage the layer 3 routing on the server end especially, and will likely
be easier to configure if you use two class C subnets instead of trying
to craft netmasks for chunks of 10.0.0. (something I did from time to
time in the old days and which can easily give you a headache getting
just the right netmask/broadcast to permit appropriate routing and
discovery).  With only 16 million addresses to play with, there isn't
any point in working hard to pack your 26 or so working addresses into
the bottom 256 addresses in your total address space.  Hell, give 'em
each their own class B -- this might even conceivably be marginally more
efficient if the IP stack processes addresses from the high (most
significant) to low (least significant) bytewise and quits when it fails
to match...;-)

Not that ANYONE could care at this level of operation at a few ns
difference max in processing time per packet...:-)

Remember when debugging any of this that tcpdump is your friend, and so
are things like netstat, ipconfig, and route.  Mistakes made in routing
can be very bad, although you sound like you are easily experienced
enough to avoid the worst of them.

I do remember when some bright lad at UNC set up a shiva fastpath back
in the 80's (when whole states were basically flat as far as ethernet
was concerned) with its user-configurable ethernet address set to a mask
that accepted all incoming packets, period.  The fastpath rapidly became
the black hole of layer 2 routing as interfaces learned that they could
ALWAYS send a packet to Mikey, he'd eat ANYTHING.  Then there were the
SGI's that came with e.g. bootp configured and running AND which were
generally the fastest things on the net, enabling a loverly race
condition.  A diskless client would come up and send out RARP asking
"who the hell am I and what should I boot?" and before its REAL server
(on the flat network, of course -- switches were heinously expensive)
could answer the SGI would pipe up with "I have no idea!" and basically
trash the boot.  Ah, the good old days -- I don't miss them one damn
bit...:-)

     rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list