[tulip-bug] tulip.c:v0.92 Rx suspended problem

Wolfgang Walter wolfgang.walter@stusta.mhn.de
Tue, 22 Aug 2000 00:56:30 +0200


On Mon, Aug 21, 2000 at 10:26:24AM -0400, Josip Loncaric wrote:
> On a few of our (supposedly identical) systems we tend to lose network
> connectivity.  I wrote a 'heartbeat' script to detect this, then do
> 'tulip-diag' and restart the interface (including removing/reloading the
> tulip driver).  Here is a fragment from the script log:
> 
> Sat Aug 19 07:15:17 EDT 2000 : heartbeat : n015 lost connectivity to fs1
> tulip-diag.c:v2.00 4/19/2000 Donald Becker (becker@scyld.com)
>  http://www.scyld.com/diag/index.html
> Index #1: Found a Lite-On 82c168 PNIC adapter at 0xd000.
>  Port selection is MII, full-duplex.
>  Transmit started, Receive started, full-duplex.
>   The Rx process state is 'Suspended -- no Rx buffers'.
>   The Tx process state is 'Idle'.
>   The transmit threshold is 128.
>  Use '-a' or '-aa' to show device registers,
>      '-e' to show EEPROM contents, -ee for parsed contents,
>   or '-m' or '-mm' to show MII management registers.
> Sat Aug 19 07:16:26 EDT 2000 : heartbeat : n015 connectivity restored to
> fs1
> 
> The above may be related to the condition reported in /var/log/messages
> a couple of minutes earlier:
> 
> Aug 19 07:13:22 n015 kernel: eth0: Restarted Rx at 2632898 / 2632898.
> 
> FYI, the above is observed on a system running Red Hat 6.2 kernel
> 2.2.16-3 updated to tulip.c:v0.92 4/17/2000.  The hardware includes Asus
> P2B motherboard (440BX chipset, single PII/400) and NetGear FA310TX
> network card w/ Lite-On chipset.  We have another 31 identically
> configured systems which do *not* have the above problem, so the cause
> is probably some intricate hardware interaction.  Perhaps changing the
> network card would help -- but we are tired of experimenting with
> network cards, so I now use my 'heartbeat' script to reset the tulip
> driver.  It would be much nicer if the tulip driver could detect the
> above problem and recover automatically  -- or better yet, avoid the
> problem entirely.
> 
> Sincerely,
> Josip
> 
> P.S.  'heartbeat' concludes that the interface is dead when pinging two
> different servers a minute apart fails in both cases.  Recovery does
> ifdown/rmmod/modprobe/ifup to reload the tulip driver, then pings a
> server again.  If the server responds, we're back in business; otherwise
> the script pauses for 10 minutes then tries again.  Most of our nodes do
> not have a problem, but this one has a problem about every 5 days.
> 
> -- 
> Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip@icase.edu
> ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
> NASA Langley Research Center             mailto:j.loncaric@larc.nasa.gov
> Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134
> 
> _______________________________________________
> tulip-bug mailing list
> tulip-bug@scyld.com
> http://www.scyld.com/mailman/listinfo/tulip-bug

This is a bug which has been in tulip since long. I described that bug for
the tulip driver in 2.2.14 and provided a fix. I have not ported that fix to
tulip 0.92.

The reason for the lock is that when will give away all rx-buffers and can't
allocate new ones there will be no rx interrupts any more. This means that
the driver will not allocate ne rx buffers at all.

A simple workaround is to increase rx_copybreak beyond the size ethernet
packets may have, i.e.

	rx_copybreak = 2000;

You may have a small performance decrease.

I probably will not port my fix to 0.92 as I don't use it. I sent the patch
to this list and to Donald Becker. As I got no response from him - he is
probably too busy - I have no idea if this bug will be fixed in a later
version.

By the way there is another bug which may cause tx to stop (i fixed that for
the driver in 2.2.15, too). So if you port the patch you may consider porting
that too.

Greetings

Wolfgang Walter