[Beowulf] Performance tuning for Linux 2.6.22 kernels with gigabit ethernet, bonding, etc.

David Kewley kewley at gps.caltech.edu
Mon Nov 19 19:56:13 PST 2007


On Tuesday 13 November 2007, Bill Johnstone wrote:
> I've been trying to quantify the performance differences between the
> cluster running on the previous switch vs. the new one.  I've been
> using the Intel MPI Benchmarks (IMB) as well as IOzone in network mode
> and also IOR.  In the previous configuration, the 64-bit nodes had only
> a single connection to the switch, and the MTU was 1500.  Under the new
> configuration, all nodes are now running with an MTU of 9000 and the
> 64-bit nodes with the tg3s are set up with the Linux bonding driver to
> form 802.3ad aggregated links using both ports per aggregate link.
> I've not adjusted any sysctls or driver settings.  The e1000 driver is
> version 7.3.20-k2-NAPI as shipped with the Linux kernel.

Looking at the master node of a Rocks cluster during mass rebuilds 
(involving HTTP transfers), I can keep the output side of the master node's 
GigE link saturated (123 MB/s a good bit of the time) with MTU 1500.  I've 
never encountered a need to increase the MTU, but I've also never done 
significant MPI over Ethernet (only Myrinet & IB).

Don't know whether an MPI load would be helped by MTU 9000, but I'd not 
assume it would without actually measuring it.

<snip>

> In trying to understand this, I noticed that ifconfig listed something
> like 2000 - 2500 dropped packets for the bonded interfaces on each
> node.  This was following a pass of IMB-MPI1 and IMB-EXT.  The dropped
> packet counts seem split roughly equally across the two bonded slave
> interfaces.  Am I correct in taking this to mean the incoming load on
> the bonded interface was simply too high for the node to service all
> the packets?  I can also note that I tried both "layer2" and "layer3+4"
> for the "xmit_hash_policy" bonding parameter, without any significant
> difference.  The switch itself uses only a layer2-based hash.

I don't know what causes the 'ifconfig' dropped-packet counter to increment.

I've seen syslog, using UDP on a central syslog server, get saturated and 
drop packets.  What I really mean by that is: syslogd's socket receive 
buffer was routinely filling up whenever there was a deluge of messages 
from compute nodes.  Whenever there is not enough room in an application's 
receive buffer for a new packet, the kernel will drop the packet, so some 
messages did not make it into syslogd, and therefore did not make it into 
the logfile on disk.  I don't know if this form of packet drop causes 
ifconfig's dropped-packet counter to increment.

When I looked into this specific problem a bit more, I discovered that 
syslogd uses the default socket buffer sizes, so the only way to change 
that (without making a one-line edit to syslogd's source and rebuilding, or 
using an alternative to ye olde syslogd) was to tune the kernel default 
socket receive buffer size:

net.core.rmem_default = 8388608  (from sysctl.conf)

This does not directly bear on your problem, but it might give you something 
to think about.

> 1. What are general network/TCP tuning parameters, e.g. buffer sizes,
> etc. that I should change or experiment with?  For older kernels, and
> especially with the 2.4 series, changing the socket buffer size was
> recommended.  However, various pieces of documentation such as
> http://www.netapp.com/library/tr/3183.pdf indicate that the newer 2.6
> series kernels "auto-tune" these buffers.  Is there still any benefit
> to manually adjusting them?

Standard ones to play with (from /proc/sys/net/core):

rmem_default
wmem_default
rmem_max
wmem_max

(from /proc/sys/net/ipv4):

tcp_rmem
tcp_wmem

I'm guessing you already knew about all those. :)

UDP uses buffer sizes from core; TCP uses the ones in ipv4.

I looked at your netapp URL, and couldn't confidently identify where it 
discusses "auto-tune".  Perhaps it's talking about nfs (server or client?) 
auto-tuning?  But perhaps the kernel doesn't auto-tune generally, for any 
old application, only for nfs?

There was most definitely a need to manually tune in my syslog example 
above, using RHEL4 2.6.9-* kernels.

> 2. For the e1000, using the Linux kernel version of the driver, what
> are the relevant tuning parameters, and what have been your experiences
> in trying various values?  There are knobs for the interrupt throttling
> rate, etc. but I'm not sure where to start.

Gosh, I went through this once, but I don't have those results readily 
available to me now.  I'm assuming you found a guide that goes into great 
detail about these tuning parameters, I think from Intel?

> 3. For the tg3, again, what are the relevant tuning parameters, and
> what have been your experiences in trying various values?  I've found
> it more difficult to find discussions for the "tunables" for tg3 as
> compared to e1000.
>
> 4. What has been people's recent experience using the Linux kernel
> bonding driver to do 802.3ad link aggregation?  What kind of throughput
> scaling have you folks seen, and what about processor load?

Can't help you on either of these.

> 5. What suggestions are there regarding trying to reduce the number of
> dropped packets?

Find the parameter, either in the kernel or in your application, that 
controls the socket receive buffer size for your application, and try 
increasing it.

David



More information about the Beowulf mailing list