[eepro100-bug] Possible Receiver transmitter bug

Donald Becker becker@scyld.com
Tue, 18 Jul 2000 22:29:51 -0400 (EDT)

On Fri, 14 Jul 2000, Joseph Varughese Modayil wrote:

> I have an Intel Pro/100+ ethernet card in a linux box and am having problems 
> with it.   The kernel messages says:
> Jul 14 10:54:21 vulcan kernel: eth0: Transmit timed out: status 7048  0000 at 334315/334329 command 000ca000.
> Jul 14 10:54:30 vulcan kernel: hde: lost interrupt
> I am hoping that the hard drive issue is caused by the network card.

Nope, but they are both caused by the same problem.
The interrupt line is blocked.

> On boot up I get the message:
> Jul 14 11:19:04 vulcan kernel: eth0: Intel EtherExpress Pro 10/100 at 0xd400, 00:D0:B7:21:14:9B, IRQ 18.

I'm guessing, but I'm probably right: Damn APIC code again.

We run our SMP clusters with the "noapic" setting, because eventually they
will topple with the same bug.  Interrupts will be blocked, and usually for
the network adapter.

$ echo 'append "noapic"' >> /etc/lilo.conf
$ /sbin/lilo

The new v1.10 driver falls back to polling mode when interrupts are
blocked.  The performance is really bad in polling mode, but it is enough to
log error messages so that you can swear more effectively.

> The code in eepro.c has two lines which mention a bug lock-up and I
> was wondering if they are correct, because I do not get the second
> message.

This is unrelated.  The eepro100 design has a rare race condition with
10Mbps receives that will latch up the receive unit.  The condition is
*very* rare, and probably has been fixed.  It supposedly only happened at
10Mbps, and the driver recovers in less than two seconds.

Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Beowulf Clusters / Linux Installations
Annapolis MD 21403