[tulip-bug] troublesome linksys NIC

Keith Warno keith.warno@valaran.com
Tue Dec 18 16:51:01 2001


Donald Becker wrote:

> On Tue, 18 Dec 2001, Keith Warno wrote:
>>>Did you see a messge that reported
>>>    printk(KERN_WARNING "%s: Tx hung, %d vs. %d.\n",
>>>
>>Yes I have seen this message before, as recent as today, but not with 
>>this current driver version (not yet anyway), only with the previous 
>>driver version which was "tulip.c:v0.92 4/17/2000".
>>
> 
> Hmmm, OK, that means that the timeout was triggered by a new transmit
> attempt, not in the timer-based monitoring code.
> 
> 
>>So we're missing a Tx done interrupt eh?  Perhaps is not seated 
>>properly?
>>
> 
> No, this is entirely a software issue, not a hardware problem.
> 
> Likely the Tx interrupt mitigation, or some part of the transmit flow
> control, has a "hole" somewhere that allows scavenging a full Tx queue
> without marking it as empty.  I don't see this in the code...

Crap.  OK, a little more detail.

We use a variety of Linksys LNE100 cards here.  Version 2.0 and version 
4.1, although I can't tell you off the top of my head what is in that box.

The kernel on the troubled box is 2.2.16-3, ie, it is a Redhat 
shrink-wrapped kernel (but it is not a RH box).  tulip.o and pci-scan.o 
were obviously built separately.

The box is a mail server, serving about 60 people.  It gets hammered all 
day, every day, and is the only box with a linksys NIC that has been 
having transmit timeouts time and time again.  However those Tx hung 
messages are a recent thing (within the past couple of weeks).

There are a couple of other heavily used and abused boxes that have 
linksys cards, Adaptec Quartet66 cards, or both and have not shown such 
symptoms as the mail server.

I'm stumped at the moment, especially if it really is a software-only 
problem.