[Beowulf] Small packets causing context switch thrashing?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Dec 2 07:08:03 PST 2004
- Previous message: [Beowulf] Small packets causing context switch thrashing?
- Next message: [Beowulf] mpirun and batch systems
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 1 Dec 2004, Tracy R Reed wrote: > Ok, this is not exactly beowulf or supercomputer related but it is > definitely a form of high performance computing and I am hoping > the beowulf community has applicable experience. > > I am building a box to convert VOIP traffic from H323 to SIP. The system > is an AMD64. Both of these protocols use RTP to transmit the voice data > which means many many small packets. We are currently looking at 8000 > packets per second due to 96 simultaneous voice channels and the box is > already at 50% cpu. I really think this box should be able to handle a lot > more than this. I have seen people talk about proxying 2000 RTP streams on > a P4. We get around 15,000 context switches and 8000 interrupts per second > and the box is heavily loaded and the load average starts going up. Is > 9000 packets per second a lot? I would not have thought so but it is I worked through a lot of the math associated with this sort of thing in a series of columns on TCP/IP and network protocols in CMW over the last 4-5 months. Measurements also help you understand things -- look into * lmbench: http://www.bitkeeper.com * netperf: http://www.netperf.org * netpipe: http://www.scl.ameslab.gov/Projects/NetPIPE/NetPIPE.html as network testing/benchmark tools. IIRC, netperf may actually be the most relevant tool for you with its RR tests, but all of these tools will measure packet latencies. In one of those columns I present the results of measuring 100 BT latency with all three tools, getting a number on the order of 150 usec. The inverse of 1.50 x 10^-4 is 6666 packets per second, using relatively old/slow hardware throughout, so 8000 pps is not at all unreasonable for faster/more modern hardware. Now, you are not alone in looking into this. I found: www.cs.columbia.edu/~dutta/research/sip-ipv6.pdf which looks like it might be relevant to your efforts and maybe would provide you with somebody to collaborate with (I was looking for a description of H23, which is not a protocol I'm familiar with, making it hard to know just what your limits are going to be). > hammering our box. I have applied several of the applicable tuning > suggestions (tcp stuff is not applicable since RTP is all UDP) from: > > http://216.239.57.104/search?q=cache:0VItqrkQdO0J:datatag.web.cern.ch/datatag/howto/tcp.html+linux+maximum+network+buffer+size&hl=en > > but the improvement has been minimal. We have some generic 100Mb ethernet Not surprising. TCP or UDP, if you want to end up with a reliable transmission protocol, you have to include pretty much the same features that are found in TCP anyway, and chances are excellent that unless you really know what you are doing and work very hard, you'll end up with something that is ultimately less efficient and/or reliable than TCP anyway. Besides, a goodly chunk of a latency hit is at the IP level and protocol independent (the wire, the switch, the cards, the kernel interface pre-TCP). In fact, you might be better off running a TCP based protocol and using one of the (relatively) new cards that support onboard TCP and RDMA. That might offload a significant amount of the packet header processing onto a NIC-based co-processor and spare your CPU from having to manage all those interrupts. > chipset in the box. I have seen a number of high performance computing > guys talk about interrupt coalescence in mailing list archives found via > google while researching this problem. Can the Pro 100 card do this or do > I need the 1000? Does it seem likely that if I run down to the store and > pay my $25 for an Intel Pro 1000 card and load up the driver with the > InterruptThrottleRate set (and what is a good value for this?) that I will > get dramatically improved performance? I would do it right now just to see > but the box is in a colo a considerable drive away so I want to have a > good idea that it will work before we make the drive. Ideally I would like > to get 10x (76800pps) the performance out of the box but could settle for > 5x (38400pps). > > Thanks for any tips you can provide! Well, it does seem (if you are indeed using 100BT) that an obvious first thing to try is to go to gigabit ethernet. I don't think it is going to get you out to where you want to go "easily" (or cheaply), but it might get there. For example, here is a pair of dual Opteron 242's using gigabit ethernet in a netperf TCP RR test: Testing with the following command line: ./netperf -l 60 -H s01 -t TCP_RR -i 10,3 -I 99,5 -- -r 1,1 -s 0 -S 0 TCP REQUEST/RESPONSE TEST to s01 : +/-2.5% @ 99% conf. Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 60.00 17431.26 65536 262142 My dual Opterons (Penguin Altus 1000E's) have integrated dual gigabit ethernet; it is not unlikely that they could sustain close to twice this rate on both channels going flat out at the same time, which would get you to your 35 Kpps just about exactly and would likely give you a bit of change from your processor dollar so that you can actually do things with the packet streams as well while upper/lower half handlers do their thing. However, in a real asynchronous environment where the packets are not just streaming in, your performance will likely be somewhat lower. Note also that your pps (latency) performance will gradually drop as the packets themselves carry more than a single byte of data until they reach the data/wirespeed bounds as opposed to the latency bounds. I'm seeing a 10-20% drop off in the TCP RR results as packet payload sizes get closer to 100 bytes, and would expect them to drop to a rate determined by a mix of the MTU selected and wirespeed as they get out to the MTU and beyond in size. >From this it looks to me like you will have marginally acceptable performance with gigabit ethernet, at best, although I >>am<< using a relatively cheap gigE switch and there are likely switches out there that cost more money that can deliver better switch latency. However, you'll also have the problem of partitioning your data stream onto two switches, and this may or may not be terribly easy. This suggests that you look into faster networks. You haven't mentioned the actual context of the conversion -- how it is being fed a packet stream, where the output packet stream goes. This seems to me to be as much of an issue as "the box" that does the actual conversion. The same limits are going to be in place at ALL LEVELS of the up/down stream networks -- a single host is only going to be able to feed your conversion box at MOST at the rates you measure for an ideal connection at the network you eventually select, very likely degraded, posssibly SIGNIFICANTLY degraded, by asynchronous contention for the resource if you are hammering the conversion box with switched packet streams from twenty or thirty hosts at once. You might actually need to consider an architecture where several hosts accept those incoming packet streams (providing a "high availability" type interface to the outside world, where traffic to the "conversion host" is dynamically rerouted to one of a small farm of conversion servers) and then either distributing the conversion process (probably smartest) or using a faster (lower latency) network to funnel the traffic back to a single conversion host. This is presuming that you can't already use a faster network between the sources of the conversion stream and the conversion host, which seems unlikely unless it is already embedded in an architecture de facto "like" this one. Hope this helps. rgb > > -- > Tracy Reed http://copilotcom.com > This message is cryptographically signed for your protection. > Info: http://copilotconsulting.com/sig > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Small packets causing context switch thrashing?
- Next message: [Beowulf] mpirun and batch systems
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
