[Beowulf] tcp error: Need ideas!

Gerry Creager gerry.creager at tamu.edu
Sun Jan 25 06:55:41 PST 2009


We've run some forensics with a real testset.  It's not the HP Procurve 
switch.

We've also seen good jumbo results with some of the managed Linksys 
48-port gigabit switches.

In other words, it's not the switch.  I tend to "think out loud" to 
expose all possible failure modes, a process I learned at NASA/Johnson 
when I worked on Space Station's Medical Operations.  In manned 
spaceflight, you have one exercise where you sit around and try to 
determine everything that could possibly go wrong, and how such a 
failure would manifest itself.  That tends to be useful in other 
operations, too.

Paulo Afonso Lopes wrote:
>> I wonder if the switch could be implicated.  We have seen some (cheap)
>> GbE switches not support (in practice) jumbo frames (irrespective of
>> literature).
> 
> I got the SMC 8624T because it advertised both Jumbo and link aggregation.
> Is this one of the "cheap" you have seen that does not work with Jumbo?
> 
> paulo
> 
> 
>> Nifty Tom Mitchell wrote:
>>> On Sat, Jan 24, 2009 at 09:36:09AM -0600, Gerry Creager wrote:
>>>> Couple of follow-up notes.
>>>>
>>>> MTU=4500:  Had one node fall over with the same overflow errors.
>>>> MTU=3000:  A WRF model is running, but single timesteps are executing
>>>> 2.5x slower than MTU=1500
>> Segment offload?  Is TSO on or off?
>>
>> 	ethtool -k eth0
>>
>> will tell you.  You might also have one very reluctant machine, in the
>> sense of being unwilling to switch their mtu.  Could you do an
>>
>> 	ifconfig eth0 | grep MTU
>>
>> on each machine and verify that everyone is using the right MTU?
>>
>>
>>>> I'll go snag the new driver and compile it.  After all: What can it
>>>> hurt!
>>>>
>>>> Thanks, Guy!
>>>>
>>>> Regards, Gerry
>>>>
>>>> Guy Coates wrote:
>>>>> Hi,
>>>>>
>>>>> We have also seen problems with the bnx2 drivers.
>>>>>
>>>>> I got a more recent set of bnx2 drivers from Broadcom:
>>>>>
>>> ......
>>>
>>> Has the data been snooped for this data to see if all
>>> is as expected.
>>>
>>> If you are seeing a natural MTU running faster than a jumbo MTU
>>> then something is fragmenting or causing fragmentation of the data.
>>>
>>> Should the MTU=4500 causes overflow errors it might be related to
>>> fragmentation.
>>> Both the sender and receiver have to keep all the bits on a reliable
>>> transfer until the data has been acknowledged.   At one time
>>> fragmentation
>>> could only be done once to a minimum MTU in the life of a packet.
>>>
>>> In addition to snooping packets try "tracepath" to and from all
>>> the involved boxes to discover what is going on.
>>>
>>>
>>
>> --
>> Joseph Landman, Ph.D
>> Founder and CEO
>> Scalable Informatics LLC,
>> email: landman at scalableinformatics.com
>> web  : http://www.scalableinformatics.com
>>         http://jackrabbit.scalableinformatics.com
>> phone: +1 734 786 8423 x121
>> fax  : +1 866 888 3112
>> cell : +1 734 612 4615
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
> 
> 

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843



More information about the Beowulf mailing list