heavy load crashes network services

Donald Becker becker@cesdis1.gsfc.nasa.gov
Thu May 27 16:44:11 1999


On Thu, 27 May 1999, gil wrote:

> The eepro card running driver eepro100.c 0.99B resulted in a system
> crash with the following error:
> 
> May  1 09:05:45 reliant kernel: eth0: Transmit timed out: status 7048 command 0000.

This status means:
  Command/Tx unit suspended (transmits are done)
  Receive unit has no resources.

The 099B driver could get into a state where it uses up all of the Rx
buffers while the kernel temporarily runs short of memory.  Since the rx()
routine isn't called again (after, no packets have been received), the Rx
buffers were never refilled.

The work-around was to increase the size of the Rx ring.

This problem should not occur with the recent drivers.

BTW, avoid using the driver distributed with the 2.2.* kernels on SMP
machines.  It's actually a modified driver, where one modification causes
the descriptors to straddle cache lines.  This causes bad PCI bus
performance and, far worse, consistency race conditions with the write
buffers.

> Then when, I tried the 3COM 3c905B card with the 3c59x.c  0.99H-WOL
> driver, I received this error which was repeated every few seconds:
> 
> May 26 19:43:55 reliant kernel: eth0: transmit timed out, tx_status 00 status e601.
> May 26 19:43:55 reliant kernel: eth0: Interrupt posted but not delivered -- IRQ blocked by another device?

Here there is a problem with the interrupts not being handled.  That's
usually a physical interrupt conflict.  The driver will limp along, passing
just one ring's worth of packets during each timer tick.  This "limp-home"
is only intended to work well enough to telnet in and fix the root problem.


Donald Becker					  becker@cesdis.gsfc.nasa.gov
USRA-CESDIS, Center of Excellence in Space Data and Information Sciences.
Code 930.5, Goddard Space Flight Center,  Greenbelt, MD.  20771
301-286-0882	     http://cesdis.gsfc.nasa.gov/people/becker/whoiam.html