[Beowulf] [Serguei.Osokine at efi.com: RE: [p2p-hackers] MTU in the real world]

Eugen Leitl eugen at leitl.org
Tue May 31 14:07:23 PDT 2005


----- Forwarded message from Serguei Osokine <Serguei.Osokine at efi.com> -----

From: Serguei Osokine <Serguei.Osokine at efi.com>
Date: Tue, 31 May 2005 10:18:30 -0700
To: "Peer-to-peer development." <p2p-hackers at zgp.org>
Subject: RE: [p2p-hackers] MTU in the real world
Reply-To: "Peer-to-peer development." <p2p-hackers at zgp.org>

On Tuesday, May 31, 2005 David Barrett wrote:
> With this in mind, have you tried using a MTU bigger than 1500 bytes
> and been bitten by it?

	Yes. That was not your typical everyday situation, but I think
some on this list might find it entertaining anyway:

	We tried to use UDP to transfer stuff over a gigabit LAN inside
the cluster. Pretty soon we discovered that with small (~1500 byte) 
packets the CPU was the bottleneck, because you can send only so many
packets per second, and the resulting throughput was nowhere close to
a gigabit. (You have to send almost 100K such packets a second to
achieve a gigabit throughput, and we were doing several times less
on our 2-CPU 2.4GHz Win XP boxes.)

	So then we tried to increase the UDP datagram size. The gigabit 
switch did not support jumbo frames, by the way, so we were fragmenting
as soon as we exceeded 1500. The throughput went up, and was pretty 
decent with 64-KB datgrams (don't remember the exact numbers, but it
was close to a gigabit and generally everything was peachy).

	Which is when the funny things started to happen. In the middle
of a test, the communication channel would just shut down and nothing 
would be delivered over it for a minute or two (though both the sender
and the receiver kept looking fine and no errors were returned by the
socket calls - sender was sending data, but the receiver recfrom()
call was not getting it); after that pause the channel would wake up 
as if nothing happened (except for several gigabytes of lost data),
work normally for a few minutes, after which this shutdown would be 
repeated, and so on. 

	Took us a while to figure out what was going on, but here is the
scoop: the gigabit LAN had a fairly small, but nonetheless non-zero
packet loss rate. When one 1500-byte frame from a 64-KB datgram is
lost, the rest of the datagram frames (all 62 KB)have to be buffered
somewhere in case the missing frame arrives and the datagram can be
fully reassembled. This arrival will never happen, but the socket 
layer does not know that, so it has to keep the partial datagram for 
a while, discarding all its frames if the missing frame won't arrive
before some timeout (RFC 1122 recommends this timout value to be 
between 60 and 120 seconds, and this seems to be in line with what
we saw).

	Now, the gigabit link sends quite a lot of data - 100MB+ per
second, to be precise. Even with 0.01% loss rate, you're losing about
10,000 bytes per second. This is no big deal, but every 1500 bytes lost
cause you to store 62KBs of partial datagrams, so with the loss rate
above you have to store 400 KB of new data every second. If this data 
expires in 120 seconds, you need about 50 MB for the partial datagram 
storage in the socket layer - and proportionally more if your data loss
rate is higher than 0.01%. And this amount of memory is something that 
the socket layer in Win XP simply does not have. So as soon as it runs
out of memory for the partially assembled datagrams, it stops the data
delivery and waits for the memory to be released. Apparently after it
gets enough free memory, it switches the data delivery back on again.

	This approach does seem funny, and I don't see any compelling 
reason for the socket layer to handle that situation in this "trigger"
fashion - either it works normally, or shuts down the data delivery
completely. Might have handled this a bit more gracefully, I'd think.
But this was Windows, and there was no arguing with it. (We were stuck
with Windows for unrelated reasons.)

	So the bottom line was, we had to go with TCP, because there was 
no way we could make the UDP transport that would be both fast enough
and would work on our hardware/OS combination. And the part about
"would work" was definitely related to an attempt to send the datgrams
that would exceed MTU. (Datagrams smaller than MTU sucked performance-
wise when compared to TCP, but that is another story - gigabit cards
tend to offload plenty of TCP functionality from the CPU, so it was
not that the UDP was particularly bad, but rather that TCP performance
was very good.)

	Best wishes -
	S.Osokine.
	31 May 2005.

-----Original Message-----
From: p2p-hackers-bounces at zgp.org [mailto:p2p-hackers-bounces at zgp.org]On
Behalf Of David Barrett
Sent: Tuesday, May 31, 2005 3:11 AM
To: Peer-to-peer development.
Subject: [p2p-hackers] MTU in the real world


I've read in multiple places that it's best to have a UDP MTU of under 
1500 bytes.  However, it sounds like most of this is based on 
theoretical analysis, and not on real-world experience.

With this in mind, have you tried using a MTU bigger than 1500 bytes and 
been bitten by it?  Basically, do you know of any emperical analysis (of 
any level of formality) of a real-world UDP application that supports or 
refutes the 1500 byte rule of thumb?

Furthermore, I've read that if you "connect" your UDP socket to the 
remote side and then start sending large packets and backing off slowly, 
the socket layer will compute the "real" MTU between two endpoints, and 
you can obtain it through "getsockopt".  Do you know of anyone who's 
tried this, and the results?

-david
_______________________________________________
p2p-hackers mailing list
p2p-hackers at zgp.org
http://zgp.org/mailman/listinfo/p2p-hackers
_______________________________________________
Here is a web page listing P2P Conferences:
http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences
_______________________________________________
p2p-hackers mailing list
p2p-hackers at zgp.org
http://zgp.org/mailman/listinfo/p2p-hackers
_______________________________________________
Here is a web page listing P2P Conferences:
http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
______________________________________________________________
ICBM: 48.07100, 11.36820            http://www.leitl.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050531/9d92b258/attachment.sig>


More information about the Beowulf mailing list