[Beowulf] Performance tuning for Linux 2.6.22 kernels with gigabit ethernet, bonding, etc.

Tue Nov 13 09:03:45 PST 2007

Hello all.

I run a small Linux cluster using gigabit ethernet as the interconnect 
There are two families of nodes:

(*) Dual-processor AMD K7-MP 2600+ models with onboard e1000 network
interfaces, single port, with a 66MHz, 64-bit PCI bus connection
according to dmesg

(*) Dual-processor AMD K8 Model 246 with onboard tg3 (BCM95704A7)
network interfaces, dual ports, with a 100 MHz, 64-bit PCIX bus
connection according to dmesg

The previous cluster admins were using a junk commodity netgear gigabit
switch.  I just upgraded to a vastly better switch with support for
jumbo frames, link aggregation, etc.  I'm aware of some private tests
with a network load generator that show this switch meets its
advertised specs.

I've been trying to quantify the performance differences between the
cluster running on the previous switch vs. the new one.  I've been
using the Intel MPI Benchmarks (IMB) as well as IOzone in network mode
and also IOR.  In the previous configuration, the 64-bit nodes had only
a single connection to the switch, and the MTU was 1500.  Under the new
configuration, all nodes are now running with an MTU of 9000 and the
64-bit nodes with the tg3s are set up with the Linux bonding driver to
form 802.3ad aggregated links using both ports per aggregate link. 
I've not adjusted any sysctls or driver settings.  The e1000 driver is
version 7.3.20-k2-NAPI as shipped with the Linux kernel.

The Linux kernel in use here is 2.6.22 and the MPI distribution is
OpenMPI 1.2.4, across the board.

I've noticed some interesting performance results.  On benchmarks with
large MPI datasets and a lot of cross-communication, the new switch
beats the old one to the tune of anywhere from 20% up to about 70%. 
Not  that surprising, since the greatest advantage of the new switch
vs. old would be in the high-capacity switching fabric.

However, for a lot of the benchmarks and especially with smaller
dataset sizes, the performance was surprisingly close or in significant
favor of the old switch.  On the parallel I/O tests which wrote and
read an NFS volume on the head node (also with link aggregation), the
results for IOR were slightly lower with the new switch vs. the old
one.  That surprised me given that now jumbo frames were in use and
that the head node (same motherboard/network configuration as the
64-bit compute nodes) was using link aggregation.  With IOzone, as the
stride sizes increased, the new switch performance dominated the old
one, but for the backward read test as well as tests with smaller
stride sizes, performance was often a toss-up.  For small-to-moderate
datasets, there were several cases in the IMB results where the old
switch was better than the new one.

In trying to understand this, I noticed that ifconfig listed something
like 2000 - 2500 dropped packets for the bonded interfaces on each
node.  This was following a pass of IMB-MPI1 and IMB-EXT.  The dropped
packet counts seem split roughly equally across the two bonded slave
interfaces.  Am I correct in taking this to mean the incoming load on
the bonded interface was simply too high for the node to service all
the packets?  I can also note that I tried both "layer2" and "layer3+4"
for the "xmit_hash_policy" bonding parameter, without any significant
difference.  The switch itself uses only a layer2-based hash.

I'm sure that some of the reason why the new switch did not beat the
previous one as decisively and across-the-board is due to the differing
switch hardware strategies regarding packet forwarding, buffering, etc.
 But I'm more concerned with how much the lackluster performance spots
are due to my lack of tuning the Linux networking environment and
drivers in any way.  To that end, I'd really appreciate your input on
the following questions:

1. What are general network/TCP tuning parameters, e.g. buffer sizes,
etc. that I should change or experiment with?  For older kernels, and
especially with the 2.4 series, changing the socket buffer size was
recommended.  However, various pieces of documentation such as
http://www.netapp.com/library/tr/3183.pdf indicate that the newer 2.6
series kernels "auto-tune" these buffers.  Is there still any benefit
to manually adjusting them?

2. For the e1000, using the Linux kernel version of the driver, what
are the relevant tuning parameters, and what have been your experiences
in trying various values?  There are knobs for the interrupt throttling
rate, etc. but I'm not sure where to start.

3. For the tg3, again, what are the relevant tuning parameters, and
what have been your experiences in trying various values?  I've found
it more difficult to find discussions for the "tunables" for tg3 as
compared to e1000.

4. What has been people's recent experience using the Linux kernel
bonding driver to do 802.3ad link aggregation?  What kind of throughput
scaling have you folks seen, and what about processor load?

5. What suggestions are there regarding trying to reduce the number of
dropped packets?

Thanks for your advice and input.

      ____________________________________________________________________________________
Be a better pen pal. 
Text or chat with friends inside Yahoo! Mail. See how.  http://overview.mail.yahoo.com/