[Beowulf] Small packets causing context switch thrashing?

Robert G. Brown rgb at phy.duke.edu
Thu Dec 2 07:08:03 PST 2004

On Wed, 1 Dec 2004, Tracy R Reed wrote:

> Ok, this is not exactly beowulf or supercomputer related but it is
> definitely a form of high performance computing and I am hoping
> the beowulf community has applicable experience.
> I am building a box to convert VOIP traffic from H323 to SIP. The system
> is an AMD64. Both of these protocols use RTP to transmit the voice data
> which means many many small packets. We are currently looking at 8000
> packets per second due to 96 simultaneous voice channels and the box is
> already at 50% cpu. I really think this box should be able to handle a lot
> more than this. I have seen people talk about proxying 2000 RTP streams on
> a P4. We get around 15,000 context switches and 8000 interrupts per second
> and the box is heavily loaded and the load average starts going up. Is
> 9000 packets per second a lot? I would not have thought so but it is

I worked through a lot of the math associated with this sort of thing in
a series of columns on TCP/IP and network protocols in CMW over the last
4-5 months.  Measurements also help you understand things -- look into

 * lmbench: http://www.bitkeeper.com

 * netperf: http://www.netperf.org

 * netpipe: http://www.scl.ameslab.gov/Projects/NetPIPE/NetPIPE.html

as network testing/benchmark tools.  IIRC, netperf may actually be the
most relevant tool for you with its RR tests, but all of these tools
will measure packet latencies.

In one of those columns I present the results of measuring 100 BT
latency with all three tools, getting a number on the order of 150 usec.
The inverse of 1.50 x 10^-4 is 6666 packets per second, using relatively
old/slow hardware throughout, so 8000 pps is not at all unreasonable for
faster/more modern hardware.

Now, you are not alone in looking into this.  I found:


which looks like it might be relevant to your efforts and maybe would
provide you with somebody to collaborate with (I was looking for a
description of H23, which is not a protocol I'm familiar with, making it
hard to know just what your limits are going to be).

> hammering our box. I have applied several of the applicable tuning
> suggestions (tcp stuff is not applicable since RTP is all UDP) from: 
> but the improvement has been minimal. We have some generic 100Mb ethernet

Not surprising.  TCP or UDP, if you want to end up with a reliable
transmission protocol, you have to include pretty much the same features
that are found in TCP anyway, and chances are excellent that unless you
really know what you are doing and work very hard, you'll end up with
something that is ultimately less efficient and/or reliable than TCP
anyway.  Besides, a goodly chunk of a latency hit is at the IP level and
protocol independent (the wire, the switch, the cards, the kernel
interface pre-TCP).

In fact, you might be better off running a TCP based protocol and using
one of the (relatively) new cards that support onboard TCP and RDMA.
That might offload a significant amount of the packet header processing
onto a NIC-based co-processor and spare your CPU from having to manage
all those interrupts.

> chipset in the box. I have seen a number of high performance computing
> guys talk about interrupt coalescence in mailing list archives found via
> google while researching this problem. Can the Pro 100 card do this or do
> I need the 1000? Does it seem likely that if I run down to the store and
> pay my $25 for an Intel Pro 1000 card and load up the driver with the
> InterruptThrottleRate set (and what is a good value for this?) that I will
> get dramatically improved performance? I would do it right now just to see
> but the box is in a colo a considerable drive away so I want to have a
> good idea that it will work before we make the drive. Ideally I would like
> to get 10x (76800pps) the performance out of the box but could settle for
> 5x (38400pps).
> Thanks for any tips you can provide!

Well, it does seem (if you are indeed using 100BT) that an obvious first
thing to try is to go to gigabit ethernet.  I don't think it is going to
get you out to where you want to go "easily" (or cheaply), but it might
get there.  For example, here is a pair of dual Opteron 242's using
gigabit ethernet in a netperf TCP RR test:

Testing with the following command line:
./netperf -l 60 -H s01 -t TCP_RR -i 10,3 -I 99,5 -- -r 1,1 -s 0 -S 0

TCP REQUEST/RESPONSE TEST to s01 : +/-2.5% @ 99% conf.
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate         
bytes  Bytes  bytes    bytes   secs.    per sec   

16384  87380  1        1       60.00    17431.26   
65536  262142

My dual Opterons (Penguin Altus 1000E's) have integrated dual gigabit
ethernet; it is not unlikely that they could sustain close to twice this
rate on both channels going flat out at the same time, which would get
you to your 35 Kpps just about exactly and would likely give you a bit
of change from your processor dollar so that you can actually do things
with the packet streams as well while upper/lower half handlers do their
thing.  However, in a real asynchronous environment where the packets
are not just streaming in, your performance will likely be somewhat

Note also that your pps (latency) performance will gradually drop as the
packets themselves carry more than a single byte of data until they
reach the data/wirespeed bounds as opposed to the latency bounds.  I'm
seeing a 10-20% drop off in the TCP RR results as packet payload sizes
get closer to 100 bytes, and would expect them to drop to a rate
determined by a mix of the MTU selected and wirespeed as they get out to
the MTU and beyond in size.

>From this it looks to me like you will have marginally acceptable
performance with gigabit ethernet, at best, although I >>am<< using a
relatively cheap gigE switch and there are likely switches out there
that cost more money that can deliver better switch latency.  However,
you'll also have the problem of partitioning your data stream onto two
switches, and this may or may not be terribly easy.

This suggests that you look into faster networks.  You haven't mentioned
the actual context of the conversion -- how it is being fed a packet
stream, where the output packet stream goes.  This seems to me to be as
much of an issue as "the box" that does the actual conversion.  The same
limits are going to be in place at ALL LEVELS of the up/down stream
networks -- a single host is only going to be able to feed your
conversion box at MOST at the rates you measure for an ideal connection
at the network you eventually select, very likely degraded, posssibly
SIGNIFICANTLY degraded, by asynchronous contention for the resource if
you are hammering the conversion box with switched packet streams from
twenty or thirty hosts at once.

You might actually need to consider an architecture where several hosts
accept those incoming packet streams (providing a "high availability"
type interface to the outside world, where traffic to the "conversion
host" is dynamically rerouted to one of a small farm of conversion
servers) and then either distributing the conversion process (probably
smartest) or using a faster (lower latency) network to funnel the
traffic back to a single conversion host.  This is presuming that you
can't already use a faster network between the sources of the conversion
stream and the conversion host, which seems unlikely unless it is
already embedded in an architecture de facto "like" this one.

Hope this helps.


> -- 
> Tracy Reed    http://copilotcom.com 
> This message is cryptographically signed for your protection.
> Info: http://copilotconsulting.com/sig

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list