[realtek-bug] Another 1.10 bug (more important - with fix)

Donald Becker becker@scyld.com
Fri, 14 Jul 2000 21:24:23 -0400 (EDT)


On Fri, 14 Jul 2000, Paul Campbell wrote:

> SYMPTOMS
> 
> 	periodically under reasonable NFS load  a random machine
> 	in my farm would stop talking - you could ping it but the
...
> 		eth0: Transmit error, Tx status 400820aa.   
> 	(a transmit abort). I found you could unwedge a stuck
> 	machine by pinging it with longish packets (say 10k bytes)
> 	at which time it printed:

This makes sense -- NFS traffic creates a very busy network, with no TCP
back-off with increasing delay.  This increases your chance of getting 16
collisions

> SOLUTION
> 
> 	The problem is in the transmit interrupt service routine's response to the
> 	transmit abort state, in the 1.10 driver line 1066 it does:
>- 		outl((TX_DMA_BURST<<8)|0x03000001, ioaddr + TxConfig);    
>+ 		outl((TX_DMA_BURST<<8), ioaddr + TxConfig);    
> 
> 	Note the missing constant - the '1' in it - I believe, according to
> 	the chip's docs, this causes the aborted packet to be retransmitted

The behavior differs among the chip version.  If I had know it was going to
change, I would have written the driver to just discard the packet.

Always discarding a 16-collision packet is arguably the correct behavior for
flow control anyway.

> 	but further down in the ISR the driver assumes that the packet
> 	is done and discards the buffer and allows the xmt entry to be reused
> 	I think that this is the cause of the hang - the tx timeout clears
> 	this and resets this state. Also the '3' value appears to put
> 	the transmitter into a state where it uses an illegal interframe gap
> 	(another possible cause of problems)

This is a point on which the documentation is unclear.  I'm pretty certain
that '0' is the right IFG setting.  But the calculation in the manual shows
'3' as the legal setting.

This can only cause a problem with a network near its maximum physical size,
which is now very rare.  The real effect is to be somewhat more aggressive
about claiming the wire, at the expense of other brands of NICs.  So, if the
driver had recovered correctly, the interface would be less likely to have a
16 collision error in the future :-O.  Of course, all other interfaces would
be more likely to have a 16-collision abort.


> NOTES
> 	This fixed my problem - it may also fix the mysterious hang other
> 	people have reported - to get to this point I ported a lot of 
> 	the rtl8139too.c driver into my linux 2.2 driver (spinlocks,
> 	the BSD fixes, the extra 4-byte problem etc etc) - this was the change

Some of the other "fixes" are bogus.  When fixing the problem with including
the CRC in the reported Rx packet length, the Rx ring packet wrap was
initially broken.  (I believe that it's now fixed in the latest 2.4.)

The BSD driver Rx problem shouldn't occur, because we use the documented
RxBufEmpty, rather than try to guess that a new packet has arrived.
Checking the next entry and guessing can potentially save one or two
expensive PCI transactions per Rx packet, but does have the problem they
encountered.


Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Beowulf Clusters / Linux Installations
Annapolis MD 21403