Channel-bonding - choice of switch?

Tue Jul 3 12:02:45 PDT 2001

Hello,

I'm currently trying to channel-bond our 5-node test cluster (running
Scyld).  Since we already had an HP 4000m ProCurve switch, 
(http://www.hp.com/rnd/products/switches/switch4000/specs.htm ),
I decided to use that.  It does support trunking and Fast
EtherChannel.

So far, I've found several things wrong this choice of switch, and
I'd like to know if there are work-arounds, and what switches you
folks prefer to use for channel-bonding with no configuration
necessary.

Problems found so far:

1) I must manually telnet to the switch, and specify which ports should
   be "trunked" together.  There is a maximum of 10 trunking groups
   offered.  This will be a problem when we decide to assemble our 32-node 
   cluster... in this case, I will need 32 groups.  The switch actually
   has 64 ports if you fill all slots.

2) When the slaves boot up, they boot off 1 NIC.  My script channel-bonds
   them after the system is up.  Unfortunately, I have to telnet to the
   switch and "UNTRUNK" it, then "RETRUNK" it after the switch is up.
   Otherwise, it will not be able to use the network to retrieve the
   image from the master.  This is a nuisance.

3) There are 3 types of trunks that I can define for each group:

   i) Trunk (SA/DA). This switch will "load-balance" the traffic based
      on the SA(source address)<->DA(dest address) of each packet.
      This means that if node 1 wants to talk to the master, the
      switch will direct it to the SAME port each time.  I would never
      be able to get 200Mbps down one channel.  This is good if the
      master is communicating with a bunch of slaves simultaneously, but
      I'd prefer to maximize the throughput anyways.
   ii) SA-Trunk (SA only).  Same principle.  
   iii) FEC (Fast EtherChannel). Same principle; auto-negotiates the above 2.

Our setup consists of 1 master and 4 slaves.  The master has 3 NICs:
eth0, eth1 and eth2.  eth0 is connected to the outside world.  eth1
and eth2 are bonded.  Each slave has 2 bonded NICs: eth0 and eth1.
All NICs are 3com 905b-tx-nm.

I wrote a script called "measure" which calculates the differences in
RX and TX packets on the master and slaves from the last time it was
ran.  This was useful in telling me whether the switch was
load-balancing and whether the OS was using channel-bonded properly.

If I don't configure the switch, and FTP a file from NODE 2 to MASTER, 
I get a lousy 1700K/s (~17Mbps):
37748736 bytes received in 21 seconds (1.7e+03 Kbytes/s)

output of "measure":
Node 2 eth0 rx 10863 tx 15061
Node 2 eth1 rx 12129 tx 15055
Node MASTER eth0 rx 1170 tx 980
Node MASTER eth1 rx 12802 tx 11515
Node MASTER eth2 rx 16540 tx 11515

As you can see the TX packets match fairly well, indicating that the
OS is properly alternating the interfaces for transmits.  The switch,
on the other hand, is responsible for distributing the RX.  As you 
can see, it isn't exactly 50/50.

I configured the switch to TRUNK the ports, and re-ran the experiment.
I got this:

37748736 bytes sent in 3.3 seconds (1.1e+04 Kbytes/s)
Node 2 eth0 rx 4 tx 7984
Node 2 eth1 rx 26198 tx 7983
Node MASTER eth0 rx 56 tx 45
Node MASTER eth1 rx 142 tx 13251
Node MASTER eth2 rx 16042 tx 13252

This is how I learned that the switch was distributing packets based
on the SA/DA.  So, all packets from Node 2 to Master go from
node2:eth1 to master:eth2.

-----------------------------------------------------------------
Here's a summary of my transfer rates:
ucb=unchannel-bonded, cb=channel-bonded, su=switch unconfigured

Master (UCB) to Node 2 (UCB): 7300 Kbytes/s
Master (UCB) to Node 2 (CB, SU): 3200 Kbytes/s
Master (CB, SU) to Node 2 (CB, SU): 1700 Kbytes/s
Master (CB) to Node 2 (CB): 11000 Kbytes/s
-----------------------------------------------------------------

Channel-bonding still appears to improve performance, but there's still
a 100Mbps bottleneck.

Can any of you recommend a reliable switch that doesn't load balance
the same way that the HP switch does?  I'd like a switch that 
automatically configures trunking and doesn't impose a 10-group
limitation.

Thanks.  I appreciate any suggestions.

-Mike

-- 
Michael J. Weller, M.Sc.               office: (972) 235-7881 x.242
weller at zyvex.com                         cell: (214) 616-6340
Zyvex Corp., 1321 N Plano           facsimile: (972) 235-7882    
Richardson, TX 75081                      icq: 6180540