[Beowulf] tcp error: Need ideas!

Gerry Creager gerry.creager at tamu.edu
Sun Jan 25 07:13:24 PST 2009


Joe Landman wrote:
> I wonder if the switch could be implicated.  We have seen some (cheap) 
> GbE switches not support (in practice) jumbo frames (irrespective of 
> literature).

Been there, done that.  HP claims to be able to handle packets up to 
9000 bytes of payload. (9122 total, IIRC)

> Nifty Tom Mitchell wrote:
>> On Sat, Jan 24, 2009 at 09:36:09AM -0600, Gerry Creager wrote:
>>> Couple of follow-up notes.
>>>
>>> MTU=4500:  Had one node fall over with the same overflow errors.
>>> MTU=3000:  A WRF model is running, but single timesteps are 
>>> executing  2.5x slower than MTU=1500
> 
> Segment offload?  Is TSO on or off?

On.

>     ethtool -k eth0
> 
> will tell you.  You might also have one very reluctant machine, in the 
> sense of being unwilling to switch their mtu.  Could you do an

-bash-3.2# ethtool -K rx off
no offload settings changed
-bash-3.2# ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

But here's the one I love:
-bash-3.2# ethtool -K tso off
no offload settings changed

I apparently can't control things with ethtool...

>     ifconfig eth0 | grep MTU

Thought of that.  All change appropriately.

> on each machine and verify that everyone is using the right MTU?
> 
> 
>>>
>>> I'll go snag the new driver and compile it.  After all: What can it 
>>> hurt!
>>>
>>> Thanks, Guy!
>>>
>>> Regards, Gerry
>>>
>>> Guy Coates wrote:
>>>> Hi,
>>>>
>>>> We have also seen problems with the bnx2 drivers.
>>>>
>>>> I got a more recent set of bnx2 drivers from Broadcom:
>>>>
>> ......
>>
>> Has the data been snooped for this data to see if all
>> is as expected.
>>
>> If you are seeing a natural MTU running faster than a jumbo MTU
>> then something is fragmenting or causing fragmentation of the data. 
>> Should the MTU=4500 causes overflow errors it might be related to 
>> fragmentation.
>> Both the sender and receiver have to keep all the bits on a reliable 
>> transfer until the data has been acknowledged.   At one time 
>> fragmentation
>> could only be done once to a minimum MTU in the life of a packet.
>>
>> In addition to snooping packets try "tracepath" to and from all the 
>> involved boxes to discover what is going on.
>>
>>
> 
> 

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843



More information about the Beowulf mailing list