[vortex-bug] More on "eth1: transmit timed out, tx_status 00 status e000."

Dylan Thomas thomasd@post.queensu.ca
Wed Oct 31 15:26:02 2001


Hello everyone.

I to have had the "eth1: transmit timed out, tx_status 00 status e000."
problem with the 3com drivers.  I don't know if this has been resolved
yet, I just suscribed to the newsgroup.

I have an 8node cluster of single cpu PIII's that have been running
rock-solid for the last year each node using one 3com 3c905B cards, and
the 3c95x kernel module that came with RedHat 6.2 Kernel 2.2.16-3.
(3c59x.c:v0.99H)

I figured I'd try the channel bonding thing, and was able to get double
the throughput with a simple test case of two nodes, 2 cross-over cables
and 2 3c905B cards in each box.  Impressed I ordered 8 more network cards
and a separate switch.

When I got around to re-building the cluster with the channel bonding on,
I decided to use the latest kernel module from the scyld site
(3c59x.c:v0.99U), which is when I started getting the "kernel: eth1:
transmit timed out, tx_status 00 status e000." errors.

I have since downgraded the kernel module back the version 0.99H, and I am
not having any problems (as of yet).

So, I will have to concur with Aaron's previous message (on Mon 10 Sep
2001) that the problem lies within the newer 3c59x driver.  (Feel free to
nicely point out to me if i'm wrong).

I've tried the newer kernel module (3c59x.c:v0.99U) with a single
dual-port 3com 3c982 card, and I got the same "kernel: eth1:
transmit timed out, tx_status 00 status e000." errors.  Obviously the
older driver does not work with this card, so I was unable to check that..

I can consistently reproduce the errors transmit timed out error by
running a program that stresses the network connection by measuring the
message passing throughput using LAM (http://www.lam-mpi.org/) as MPI
libraries.  I'm sure this is not the only way to reproduce the errors,
Andre Holzner mentioned that his occured when transfering large files.

I'm also pretty sure that I've set up channel bonding correctly. (as it
works with the older 3c59x driver)

Below is some snippits from the syslog of the computer that hangs.

If anyone needs any specific info let me know.

Sincerely
-Dylan

<snip>

 kernel: 3c59x.c:v0.99U 7/30/2001 Donald Becker, becker@scyld.com
 kernel:   http://www.scyld.com/network/vortex.html
 kernel: eth0: 3Com 3c905 Boomerang 100baseTx at 0xd000,
00:60:08:ab:13:89, IRQ 5
 kernel:   8K buffer 3:5 Rx:Tx split, autoselect/MII interface.
 kernel:   MII transceiver found at address 24, status 7849.
 kernel:   Using bus-master transmits and whole-frame receives.
 kernel: eth1: 3Com 3c905B Cyclone 100baseTx at 0xb800,
00:01:02:c8:78:a7, IRQ 10
 kernel:   8K buffer 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
 kernel:   MII transceiver found at address 24, status 7849.
 kernel:   MII transceiver found at address 0, status 7849.
 kernel:   Using bus-master transmits and whole-frame receives.

<snip>

network: Setting network parameters succeeded
ifup: SIOCADDRT: Network is unreachable
network: Bringing up interface lo succeeded
network: Bringing up interface bond0 succeeded
ifup: Enslaving eth0 to bond0
ifup: master has no hw address assigned; getting one from slave!
network: Bringing up interface eth0 succeeded
ifup: Enslaving eth1 to bond0
network: Bringing up interface eth1 succeeded

<snip>

kernel: eth0: Setting full-duplex based on MII #24 link partner capability
of 45e1.

<snip>

At this point I start my LAM program, which loads the network.. after less
than 60 seconds the program hangs, waiting for data, and the syslog
reports....

<snip>

kernel: eth1: transmit timed out, tx_status 00 status e000.
last message repeated 7 times
last message repeated 13 times


Cheers again.

-Dylan