Receiver lock-up workaround not enabled properly?

Jason M. Felice jasonf@Baldwingroup.COM
Mon Aug 23 15:30:01 1999


Problem Summary:
I have a grand total of eighteen Intel boxes with onboard eepro100 boards.  Two
or three of them have an additional PCI eepro100.  The ones with two NICS all
log the same messages during boot time in regards to the eepro100 boards.  One
of those machines, which is a pretty heavily hit box (about 16 CIPE nodes
generating traffic between 8-10 hours per day) has sufferred three unrecoverable
network outages to date, all of them with the same symptoms:

1) The box will not respond to traffic on either interface.
2) An `fping' job that runs every two hours, and other programs as well,
   reports ENOBUFS.  I'm assuming this is effect and not cause.
3) I get plenty of these:
   Aug 23 07:26:02 nccr-cle-c kernel: dst cache overflow
4) The only resolution so far has been to reboot the machine.

Interesting Notes:

kernel 2.2.7, eepro100.c v1.06 built as module, hacked RH5.2, using CIPE 1.2.0

Here is a snippet from the logs of what messages the driver emits during boot:
Aug 23 07:56:15 nccr-cle-c kernel: eth0: Intel EtherExpress Pro 10/100 at 0xef00, 00:90:27:3E:33:EA, IRQ 10.
Aug 23 07:56:15 nccr-cle-c kernel:   Board assembly 000000-000, Physical connectors present: RJ45
Aug 23 07:56:15 nccr-cle-c kernel:   Primary interface chip i82555 PHY #1.
Aug 23 07:56:15 nccr-cle-c kernel:   General self-test: passed.
Aug 23 07:56:15 nccr-cle-c kernel:   Serial sub-system self-test: passed.
Aug 23 07:56:15 nccr-cle-c kernel:   Internal registers self-test: passed.
Aug 23 07:56:15 nccr-cle-c kernel:   ROM checksum self-test: passed (0x04f4518b).
Aug 23 07:56:15 nccr-cle-c kernel:   Receiver lock-up workaround activated.
Aug 23 07:56:15 nccr-cle-c kernel: eth1: Intel EtherExpress Pro 10/100 at 0xed80, 00:90:27:54:9E:30, IRQ 11.
Aug 23 07:56:15 nccr-cle-c kernel:   Receiver lock-up bug exists -- enabling work-around.
Aug 23 07:56:15 nccr-cle-c kernel:   Board assembly 721383-006, Physical connectors present: RJ45
Aug 23 07:56:15 nccr-cle-c kernel:   Primary interface chip i82555 PHY #1.
Aug 23 07:56:15 nccr-cle-c kernel:   General self-test: passed.
Aug 23 07:56:15 nccr-cle-c kernel:   Serial sub-system self-test: passed.
Aug 23 07:56:15 nccr-cle-c kernel:   Internal registers self-test: passed.
Aug 23 07:56:15 nccr-cle-c kernel:   ROM checksum self-test: passed (0x04f4518b).
Aug 23 07:56:15 nccr-cle-c kernel: iplogd uses obsolete (PF_INET,SOCK_PACKET)

The interesting note is the two different messages for the receiver lock-up
work around, e.g. this one on the first board:
Aug 23 07:56:15 nccr-cle-c kernel:   Receiver lock-up workaround activated.
and this one on the second board:
Aug 23 07:56:15 nccr-cle-c kernel:   Receiver lock-up bug exists -- enabling work-around.

Checking eepro100.c, on the second board it isn't really enabling the receiver
work-around.  As a matter of fact, the second message is emitted when
 (eeprom[3] & 0x03 != 0), and the first message is emitted when the lock-up
work-around is actually enabled, which is when (eeprom[3] & 0x03 != 3).

I don't have access to specs, so I can't determine whether the one message is
inaccurate or the enabling of the RX lock-up bug is inaccurate.  I hope there
is enough information here for someone who has access to specs to be able to
fix (something) easily :)

-Jason M. Felice

P.S. For the time being, I'm going to force the receiver lock-up workaround
for that machine on both cards, as it doesn't look like it can harm anything
(other than a bit of performance).