[vortex] transmit timed out, tx_status 00 status e000 - Fixed

Thu Jan 17 10:55:00 2002

Ok,

I tried changing the TX_TIMEOUT to 6*HZ, it did not make any difference
that
I could notice in the transmitter stopping.

I also tried changing the TX_RING_SIZE to 64 (from 16),
and the TX_QUEUE_LENGTH to 40 (from 10), I basically just multiplied
everything
by 4.

This allowed my test to survive a 1,000,000 iteration test, previously
the network
had only survived a 10,000 iteration test 1 time out of 30 runs, so this
looks very
significant.    An iteration is sending one 32,768 tcp packet through
MPI to another
machine.   Each test runs all N iterations 3 times with different
transmitter/receiver
pairs, I am running these test with 4 nodes.  

I then changed the 6*HZ back to 2*HZ to see which set of changes made
the
difference, and the second set of changes appears to be what fixed the
problem.  This
may hint to the experts at where the bug is, or what is actually going
on to
cause the problem.   It may be that less extreme changes in the ring
size and
queue length would result in correct behavior.

I am doing further testing to make sure that this fix does actually
work, I will probably
be running 16+ machines on a larger test to satisfy myself that the
setting changes
have appeared to eliminate the problem.

And right now we aren't using any bonded ethernet or priority packet
features on any
of the machine with the 3com driver in it, so that is not an issue.

This very much looks like a software bug, either in the 2.2.19 kernel or
in the 3com
driver.    I would doubt that the changes I made would work around any
real hardware
bug.

				Roger

> -----Original Message-----
> From:	Dylan Thomas [SMTP:thomasd@post.queensu.ca]
> Sent:	 1/ 16/ 2002 3:21 PM
> To:	vortex@scyld.com
> Subject:	[vortex] transmit timed out, tx_status 00 status e000
> 
> 
> Hello everyone, I'll re-add my two cents worth on this subject.
> On Wed October 31st I posted this problem to the vortex-bug newsgroup,
> but
> since that time, only one person had responded.  (not to the
> newsgroup,
> directly to me..)
> 
> Here is a link to my original post.
> 
> http://www.scyld.com/pipermail/vortex-bug/2001-October/001017.html
> 
> The person who responded to my post mentioned the following...
> 
> --------
> 
> I also have the same problem, but only when connected at 100Mbit/Full
> duplex, when I connect to an old hub at 10 Mbit I have no problem at
> all.
> I browsed the code to see when the error message is generated and
> discovered that :
> 
> in the recent 2.2.20 kernel TX_TIMEOUT is set to (400*HZ)/1000,
> whereas in
> the 0.99U version it is set to 2*HZ
> 
> Just for curiosity I tried to change this value to 4*HZ and no more
> transmit timeouts occured any more. (I do a tar -zcvf of the whole
> disk
> over the network to test).
> 
> Just did another testrun (under heavy load; eg setathome :) ) and got
> again tx_timeouts. increasing TX_TIMEOUT to 6*HZ .... and no more
> errors.
> Something is definitely wrong here with the timings.
> 
> Maybe this helps you.
> 
> Gerhard
> 
> ------
> 
> I have NOT had a chance to verify this, as I am still using the old
> kernel
> (2.2.16) and driver on my cluster.. Perhaps this message will help
> 
> Sincerely
> 
> -Dylan
> 
> 
> 
> 
> 
> _______________________________________________
> vortex mailing list
> vortex@scyld.com
> http://www.scyld.com/mailman/listinfo/vortex