PNIC chip botch/hang under heavy load...

Dave Platt dplatt@radagast.org
Wed Apr 21 13:45:08 1999


Background: we have a whole bunch of Linux systems running RedHat 5.1
and 5.2.  At my recommendation, almost all of them have been equipped
with NetGear FA310TX cards... a mix of the older version (based on the
DEC 21140) and the newer (based on the PNIC).

Several of these systems are suffering from Ethernet lockups... typically
under high-load conditions, when transferring large files (hundreds of
megabytes) via FTP.  The symptoms of the lockup are that the Tulip
driver in the kernel reports a burst of "Oversized Ethernet frame 
spanned multiple buffers" with what appear to be some garbaged
packet-status flags.  The Ethernet board then locks up, failing to
receive any further packets.  The only cure we know of is to
restart the network stack (taking the interface down and then reconfiguring
it) - this resets the chip and permits subsequent operations to occur.

The problem has show up only on the more recent FA310TX boards which use
the Netgear-labeled PNIC chip.  Older revisions of the FA310TX which used
one of the Digital 21140 chips have never exhibited this problem, nor have
cards from other vendors which also use the Digital chip.  The affected
PNIC-based FA310TX boards seem to operate fine under light load, but
when stressed (e.g. via an FTP "get" operation on a 100BaseTX network)
will usually lock up within a minute or so.

We've observed the problem with at least three different versions of the
Tulip driver - the 0.89 driver shipped with Red Hat 5.2, the Netgear-
modified 0.89k driver on the Netgear Web site, and the 0.90z development
driver on Don Becker's Web site.

Here's a typical set of log entries:

Apr 17 04:02:56 jsmith kernel: Found NETGEAR NGMC169 MAC at PCI I/O address 0xb000.
Apr 17 04:02:56 jsmith kernel: tulip.c:v0.89K 8/8/98 Originally written by becker@cesdis.gsfc.nasa.gov
Apr 17 04:02:56 jsmith kernel: Driver modified by Netgear for FA310TX
Apr 17 04:02:56 jsmith kernel: Netgear technical support: support@netgear.matrixx.net -- big hammer from tivo
Apr 17 04:02:56 jsmith kernel: eth0: NETGEAR NGMC169 MAC at 0xb000, 00 a0 cc 3f a0 82, IRQ 10.
Apr 17 04:02:56 jsmith kernel: eth0: Checking for MII transceivers...
Apr 17 04:02:56 jsmith kernel: eth0:  MII transceiver found at MDIO address 1, config 1000 status 782d.

Apr 17 04:02:59 jsmith kernel: eth0: The transmitter stopped!  CSR5 is 2678016, CSR6 812e2002.
Apr 17 04:02:59 jsmith kernel: eth0: Changing NGMC169 configuration to half-duplex, CSR6 812e0000.

Apr 17 04:06:43 jsmith kernel: eth0: Oversized Ethernet frame spanned multiple buffers, status 7fff0200!
Apr 17 04:06:43 jsmith kernel: eth0: Oversized Ethernet frame spanned multiple buffers, status 06688186!
Apr 17 04:06:50 jsmith kernel: eth0: Oversized Ethernet frame spanned multiple buffers, status 7fff0200!
Apr 17 04:06:50 jsmith kernel: eth0: Oversized Ethernet frame spanned multiple buffers, status 08018192!

At this point, the card is wedged.

A later set of tests with a different driver:

Apr 17 04:49:18 jsmith kernel: Found Lite-On 82c168 PNIC at PCI I/O address 0xb000.
Apr 17 04:49:18 jsmith kernel: tulip.c:v0.90z 4/7/99 becker@cesdis.gsfc.nasa.gov
Apr 17 04:49:18 jsmith kernel: eth0: Lite-On 82c168 PNIC rev 33 at 0xb000, 00:A0:CC:3F:A0:82, IRQ 10.
Apr 17 04:49:18 jsmith kernel: eth0:  MII transceiver #1 config 1000 status 782d advertising 01e1.

Apr 17 04:49:22 jsmith kernel: eth0: The transmitter stopped.  CSR5 is 2678016, CSR6 810e2002, new CSR6 810e0000.
Apr 17 04:49:22 jsmith kernel: eth0: Changing PNIC configuration to half-duplex, CSR6 810e0000.

Apr 17 04:51:14 jsmith kernel: eth0: Oversized Ethernet frame spanned multiple buffers, status 7fff0200!
Apr 17 04:51:14 jsmith kernel: eth0: Oversized Ethernet frame spanned multiple buffers, status 06318186!
Apr 17 04:51:23 jsmith kernel: eth0: Oversized Ethernet frame spanned multiple buffers, status 7fff0200!
Apr 17 04:51:23 jsmith kernel: eth0: Oversized Ethernet frame spanned multiple buffers, status 06738182!
Apr 17 04:51:33 jsmith kernel: eth0: Oversized Ethernet frame spanned multiple buffers, status 7fff0200!
Apr 17 04:51:33 jsmith kernel: eth0: Oversized Ethernet frame spanned multiple buffers, status 08018192!

Anybody else seen this sort of behavior?  Any clues, or hints for a fix or for a
patch to the driver which would reinitialize the chip after it goes haywire?

I've got a call in to Netgear and have sent them this information... no reply yet.