[Beowulf] Performance tuning for Linux 2.6.22 kernels with gigabit ethernet, bonding, etc.

Mon Nov 19 19:56:13 PST 2007

On Tuesday 13 November 2007, Bill Johnstone wrote:
> I've been trying to quantify the performance differences between the
> cluster running on the previous switch vs. the new one.  I've been
> using the Intel MPI Benchmarks (IMB) as well as IOzone in network mode
> and also IOR.  In the previous configuration, the 64-bit nodes had only
> a single connection to the switch, and the MTU was 1500.  Under the new
> configuration, all nodes are now running with an MTU of 9000 and the
> 64-bit nodes with the tg3s are set up with the Linux bonding driver to
> form 802.3ad aggregated links using both ports per aggregate link.
> I've not adjusted any sysctls or driver settings.  The e1000 driver is
> version 7.3.20-k2-NAPI as shipped with the Linux kernel.

Looking at the master node of a Rocks cluster during mass rebuilds 
(involving HTTP transfers), I can keep the output side of the master node's 
GigE link saturated (123 MB/s a good bit of the time) with MTU 1500.  I've 
never encountered a need to increase the MTU, but I've also never done 
significant MPI over Ethernet (only Myrinet & IB).

Don't know whether an MPI load would be helped by MTU 9000, but I'd not 
assume it would without actually measuring it.

<snip>

> In trying to understand this, I noticed that ifconfig listed something
> like 2000 - 2500 dropped packets for the bonded interfaces on each
> node.  This was following a pass of IMB-MPI1 and IMB-EXT.  The dropped
> packet counts seem split roughly equally across the two bonded slave
> interfaces.  Am I correct in taking this to mean the incoming load on
> the bonded interface was simply too high for the node to service all
> the packets?  I can also note that I tried both "layer2" and "layer3+4"
> for the "xmit_hash_policy" bonding parameter, without any significant
> difference.  The switch itself uses only a layer2-based hash.

I don't know what causes the 'ifconfig' dropped-packet counter to increment.

I've seen syslog, using UDP on a central syslog server, get saturated and 
drop packets.  What I really mean by that is: syslogd's socket receive 
buffer was routinely filling up whenever there was a deluge of messages 
from compute nodes.  Whenever there is not enough room in an application's 
receive buffer for a new packet, the kernel will drop the packet, so some 
messages did not make it into syslogd, and therefore did not make it into 
the logfile on disk.  I don't know if this form of packet drop causes 
ifconfig's dropped-packet counter to increment.

When I looked into this specific problem a bit more, I discovered that 
syslogd uses the default socket buffer sizes, so the only way to change 
that (without making a one-line edit to syslogd's source and rebuilding, or 
using an alternative to ye olde syslogd) was to tune the kernel default 
socket receive buffer size:

net.core.rmem_default = 8388608  (from sysctl.conf)

This does not directly bear on your problem, but it might give you something 
to think about.

> 1. What are general network/TCP tuning parameters, e.g. buffer sizes,
> etc. that I should change or experiment with?  For older kernels, and
> especially with the 2.4 series, changing the socket buffer size was
> recommended.  However, various pieces of documentation such as
> http://www.netapp.com/library/tr/3183.pdf indicate that the newer 2.6
> series kernels "auto-tune" these buffers.  Is there still any benefit
> to manually adjusting them?

Standard ones to play with (from /proc/sys/net/core):

rmem_default
wmem_default
rmem_max
wmem_max

(from /proc/sys/net/ipv4):

tcp_rmem
tcp_wmem

I'm guessing you already knew about all those. :)

UDP uses buffer sizes from core; TCP uses the ones in ipv4.

I looked at your netapp URL, and couldn't confidently identify where it 
discusses "auto-tune".  Perhaps it's talking about nfs (server or client?) 
auto-tuning?  But perhaps the kernel doesn't auto-tune generally, for any 
old application, only for nfs?

There was most definitely a need to manually tune in my syslog example 
above, using RHEL4 2.6.9-* kernels.

> 2. For the e1000, using the Linux kernel version of the driver, what
> are the relevant tuning parameters, and what have been your experiences
> in trying various values?  There are knobs for the interrupt throttling
> rate, etc. but I'm not sure where to start.

Gosh, I went through this once, but I don't have those results readily 
available to me now.  I'm assuming you found a guide that goes into great 
detail about these tuning parameters, I think from Intel?

> 3. For the tg3, again, what are the relevant tuning parameters, and
> what have been your experiences in trying various values?  I've found
> it more difficult to find discussions for the "tunables" for tg3 as
> compared to e1000.
>
> 4. What has been people's recent experience using the Linux kernel
> bonding driver to do 802.3ad link aggregation?  What kind of throughput
> scaling have you folks seen, and what about processor load?

Can't help you on either of these.

> 5. What suggestions are there regarding trying to reduce the number of
> dropped packets?

Find the parameter, either in the kernel or in your application, that 
controls the socket receive buffer size for your application, and try 
increasing it.

David