[tulip-bug] tulip.c:v0.92 Rx suspended problem

Josip Loncaric josip@icase.edu
Mon, 21 Aug 2000 10:26:24 -0400


On a few of our (supposedly identical) systems we tend to lose network
connectivity.  I wrote a 'heartbeat' script to detect this, then do
'tulip-diag' and restart the interface (including removing/reloading the
tulip driver).  Here is a fragment from the script log:

Sat Aug 19 07:15:17 EDT 2000 : heartbeat : n015 lost connectivity to fs1
tulip-diag.c:v2.00 4/19/2000 Donald Becker (becker@scyld.com)
 http://www.scyld.com/diag/index.html
Index #1: Found a Lite-On 82c168 PNIC adapter at 0xd000.
 Port selection is MII, full-duplex.
 Transmit started, Receive started, full-duplex.
  The Rx process state is 'Suspended -- no Rx buffers'.
  The Tx process state is 'Idle'.
  The transmit threshold is 128.
 Use '-a' or '-aa' to show device registers,
     '-e' to show EEPROM contents, -ee for parsed contents,
  or '-m' or '-mm' to show MII management registers.
Sat Aug 19 07:16:26 EDT 2000 : heartbeat : n015 connectivity restored to
fs1

The above may be related to the condition reported in /var/log/messages
a couple of minutes earlier:

Aug 19 07:13:22 n015 kernel: eth0: Restarted Rx at 2632898 / 2632898.

FYI, the above is observed on a system running Red Hat 6.2 kernel
2.2.16-3 updated to tulip.c:v0.92 4/17/2000.  The hardware includes Asus
P2B motherboard (440BX chipset, single PII/400) and NetGear FA310TX
network card w/ Lite-On chipset.  We have another 31 identically
configured systems which do *not* have the above problem, so the cause
is probably some intricate hardware interaction.  Perhaps changing the
network card would help -- but we are tired of experimenting with
network cards, so I now use my 'heartbeat' script to reset the tulip
driver.  It would be much nicer if the tulip driver could detect the
above problem and recover automatically  -- or better yet, avoid the
problem entirely.

Sincerely,
Josip

P.S.  'heartbeat' concludes that the interface is dead when pinging two
different servers a minute apart fails in both cases.  Recovery does
ifdown/rmmod/modprobe/ifup to reload the tulip driver, then pings a
server again.  If the server responds, we're back in business; otherwise
the script pauses for 10 minutes then tries again.  Most of our nodes do
not have a problem, but this one has a problem about every 5 days.

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip@icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric@larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134