Network locking up with SMC EPIC/100 Ethernet cards

Terry Barnaby terry@beam.demon.co.uk
Wed Mar 3 07:28:58 1999


Subject:	Network locking up with SMC EPIC/100 Ethernet cards

On in depth driver testing it was found that the 83C171 stops running
the transmit ring queue. There are two entries in the transmit queue with the
ring status fields set to: 0x6003 and 0x8000. The chip will restart on
setting the TXQUEUED bit high. Quite often the fault will not occur for some
time (2mins) and then repeatibly fail every second for a period.

I believe we have tracked down this problem. It appears to be a race hazard in
setting the transmit ring buffers status register to setting the TXQUEUED bit.
In certain situations, for example when of two nodes ping ponging network
packets, then the situation can arise, on a fast machine, where a packet is
placed into the ring and the TXQUEUED bit is set just as the EPIC chip has
come to the end of the ring and is clearing the TXQUEUED bit. This can leave
two packets in the ring awaiting transmission. This is only cleared on some
other packets requiring transmission setting the TXQUEUED  bit.

This problem may occur in other network drivers including the Tulip driver.

I am not sure of the best way to cure this. One way that appears to work is
to check when the TxEmpty interrupt occurs if there are more than one packet
in the ring. If so then set the TXQUEUED bit. Another way is a delay between
the setting of the status ring entry and setting the TXQUEUED bit.

In looking at the driver there appears to be another possible problem:

	1. In the epic_interrupt routine the variable dirty_tx should be
		an unsigned int. After 2G packets the driver may fall over
		(10 hours at peak rate). Actualy I beleive the driver fall
		over after 4G packets anyway ? Should'nt the cur_tx/dirty_tx
		and apporiate RX pointers be reset when they get larger.
		This appears to be in the Tulip driver as well.
		

THE PROBLEM
==============================================
We have a problem with networking under Linux. We believe it is associated with
the EPIC/100 Ethernet card driver, but it could be the Linux netwoking system.
The problem is that the ethernet interface seems to lock up occassionly. It
appears that the system will accept no more packets until a packet is
transmitted. It appears to be timing or packet size related.

We have noticed this problem for around 6 months now with other motherboards,
processors, hubs and switches, kernel versions and EPIC driver versions,
but have finaly got around to trying to track down the problem.

We have a simple test program that demonstrates the problem.
The test application ping-pongs a 2568 byte lump of data between two
processes running on two nodes, using the write(), read() system calls. The
TCP/IP socket has been set to TCP_NODELAY. When not locked up the test program
reports a bi-directional data rate of 5.708708 MBytes per second. Buffer
transmit sizes of 2568 seems to cause this to happen more frequently.
With this saturated test the system will lock up about once per 10 seconds.
Pinging from the remote machine unlocks the network.

The current systems are:

	Motherboard:	Supermicro P6DBU
	Processors:	Pentium II 450Mhz
	Ethernet:	SMC Etherpower II (Chip: 83C171A2QF P, A29840, BH6482.1)
	Linux:		RedHat 5.2
	Kernel:		2.0.36
	Driver:		EPIC/100 v1.03, 1.06, 1.07
	Hub:		Allied telesyn MR904TX
	
We have noticed this problem for around 6 months now with other motherboards,
processors, hubs and switches, kernel versions and EPIC driver versions,
but have finaly got around to trying to track down the problem.

Setting the EPIC100.c drivers debug level to 4 does not provide and messages.
There are no errors reported in /proc/net/dev

The /proc/tcp file has the following entry when locked up
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   
uid  timeout inode
   0: 0200A8C0:13D8 0100A8C0:041B 01 00000454:00000000 01:0000037C 00000006  
1002        0 1663
   1: 0200A8C0:13D8 00000000:0000 0A 00000000:00000000 00:FFFF405E 00000000  
1002        0 1660

When Running normally it has the following
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   
uid  timeout inode
   0: 0200A8C0:13D8 0100A8C0:041B 01 00000A08:00000000 01:00000014 00000000  
1002        0 1663
   1: 0200A8C0:13D8 00000000:0000 0A 00000000:00000000 00:FFFEFC69 00000000  
1002        0 1660

We have swapped the SMC EPIC/100 board for an older SMC Tulip based card
and all works fine. So this points to the EPIC/100 driver.

It appears that an older version of the SMC Chip 83C170 may not have this
problem or it might be different.


-- 
  Dr Terry Barnaby                     BEAM Ltd
  Phone: +44 1454 324512               Northavon Business Center, Dean Rd
  Fax:   +44 1454 313172               Yate, Bristol, BS17 5NH, UK
  Email: terry@beam.demon.co.uk        Web: www.beam.demon.co.uk
  BEAM for: Visually Impaired X-Terminals, Parallel Processing, Software Dev
                         "Tandems are twice the fun !"


 | To unsubscribe, send mail to Majordomo@cesdis.gsfc.nasa.gov, and within the
 |  body of the mail, include only the text:
 |   unsubscribe this-list-name youraddress@wherever.org
 | You will be unsubscribed as speedily as possible.