[Beowulf] [Serguei.Osokine@efi.com: RE: [p2p-hackers] MTU in the real world]
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Eugen Leitl eugen at leitl.orgTue May 31 14:07:23 PDT 2005
- Previous message: [Beowulf] SBAC-PAD 2005: New Extended Deadline - June, 2nd
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
----- Forwarded message from Serguei Osokine <Serguei.Osokine at efi.com> ----- From: Serguei Osokine <Serguei.Osokine at efi.com> Date: Tue, 31 May 2005 10:18:30 -0700 To: "Peer-to-peer development." <p2p-hackers at zgp.org> Subject: RE: [p2p-hackers] MTU in the real world Reply-To: "Peer-to-peer development." <p2p-hackers at zgp.org> On Tuesday, May 31, 2005 David Barrett wrote: > With this in mind, have you tried using a MTU bigger than 1500 bytes > and been bitten by it? Yes. That was not your typical everyday situation, but I think some on this list might find it entertaining anyway: We tried to use UDP to transfer stuff over a gigabit LAN inside the cluster. Pretty soon we discovered that with small (~1500 byte) packets the CPU was the bottleneck, because you can send only so many packets per second, and the resulting throughput was nowhere close to a gigabit. (You have to send almost 100K such packets a second to achieve a gigabit throughput, and we were doing several times less on our 2-CPU 2.4GHz Win XP boxes.) So then we tried to increase the UDP datagram size. The gigabit switch did not support jumbo frames, by the way, so we were fragmenting as soon as we exceeded 1500. The throughput went up, and was pretty decent with 64-KB datgrams (don't remember the exact numbers, but it was close to a gigabit and generally everything was peachy). Which is when the funny things started to happen. In the middle of a test, the communication channel would just shut down and nothing would be delivered over it for a minute or two (though both the sender and the receiver kept looking fine and no errors were returned by the socket calls - sender was sending data, but the receiver recfrom() call was not getting it); after that pause the channel would wake up as if nothing happened (except for several gigabytes of lost data), work normally for a few minutes, after which this shutdown would be repeated, and so on. Took us a while to figure out what was going on, but here is the scoop: the gigabit LAN had a fairly small, but nonetheless non-zero packet loss rate. When one 1500-byte frame from a 64-KB datgram is lost, the rest of the datagram frames (all 62 KB)have to be buffered somewhere in case the missing frame arrives and the datagram can be fully reassembled. This arrival will never happen, but the socket layer does not know that, so it has to keep the partial datagram for a while, discarding all its frames if the missing frame won't arrive before some timeout (RFC 1122 recommends this timout value to be between 60 and 120 seconds, and this seems to be in line with what we saw). Now, the gigabit link sends quite a lot of data - 100MB+ per second, to be precise. Even with 0.01% loss rate, you're losing about 10,000 bytes per second. This is no big deal, but every 1500 bytes lost cause you to store 62KBs of partial datagrams, so with the loss rate above you have to store 400 KB of new data every second. If this data expires in 120 seconds, you need about 50 MB for the partial datagram storage in the socket layer - and proportionally more if your data loss rate is higher than 0.01%. And this amount of memory is something that the socket layer in Win XP simply does not have. So as soon as it runs out of memory for the partially assembled datagrams, it stops the data delivery and waits for the memory to be released. Apparently after it gets enough free memory, it switches the data delivery back on again. This approach does seem funny, and I don't see any compelling reason for the socket layer to handle that situation in this "trigger" fashion - either it works normally, or shuts down the data delivery completely. Might have handled this a bit more gracefully, I'd think. But this was Windows, and there was no arguing with it. (We were stuck with Windows for unrelated reasons.) So the bottom line was, we had to go with TCP, because there was no way we could make the UDP transport that would be both fast enough and would work on our hardware/OS combination. And the part about "would work" was definitely related to an attempt to send the datgrams that would exceed MTU. (Datagrams smaller than MTU sucked performance- wise when compared to TCP, but that is another story - gigabit cards tend to offload plenty of TCP functionality from the CPU, so it was not that the UDP was particularly bad, but rather that TCP performance was very good.) Best wishes - S.Osokine. 31 May 2005. -----Original Message----- From: p2p-hackers-bounces at zgp.org [mailto:p2p-hackers-bounces at zgp.org]On Behalf Of David Barrett Sent: Tuesday, May 31, 2005 3:11 AM To: Peer-to-peer development. Subject: [p2p-hackers] MTU in the real world I've read in multiple places that it's best to have a UDP MTU of under 1500 bytes. However, it sounds like most of this is based on theoretical analysis, and not on real-world experience. With this in mind, have you tried using a MTU bigger than 1500 bytes and been bitten by it? Basically, do you know of any emperical analysis (of any level of formality) of a real-world UDP application that supports or refutes the 1500 byte rule of thumb? Furthermore, I've read that if you "connect" your UDP socket to the remote side and then start sending large packets and backing off slowly, the socket layer will compute the "real" MTU between two endpoints, and you can obtain it through "getsockopt". Do you know of anyone who's tried this, and the results? -david _______________________________________________ p2p-hackers mailing list p2p-hackers at zgp.org http://zgp.org/mailman/listinfo/p2p-hackers _______________________________________________ Here is a web page listing P2P Conferences: http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences _______________________________________________ p2p-hackers mailing list p2p-hackers at zgp.org http://zgp.org/mailman/listinfo/p2p-hackers _______________________________________________ Here is a web page listing P2P Conferences: http://www.neurogrid.net/twiki/bin/view/Main/PeerToPeerConferences ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050531/9d92b258/attachment.bin
- Previous message: [Beowulf] SBAC-PAD 2005: New Extended Deadline - June, 2nd
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
