[realtek] Fix for tx-timeout on rtl8139

Edgar Toernig froese@gmx.de
Mon Feb 25 23:32:00 2002


I'm using the rtl8139 (1.16a) driver on a 2.0.32 kernel and
got this kind of tx timeout under heavy load (watching TV on
an X-terminal):

eth0: Transmit timeout, status 0d 0000 media 00.
eth0: Tx queue start entry 66378  dirty entry 66378.
eth0:  Tx descriptor 0 is 0008a5ea.
eth0:  Tx descriptor 1 is 0008a5ea.
eth0:  Tx descriptor 2 is 0008a5ea. (queue head)
eth0:  Tx descriptor 3 is 0008a5ea.
eth0: MII #32 registers are: 1100 782d 0000 0000 01e1 41e1 0001 0000.

As can be seen, the tx queue is empty and all packets had been
sent successfully.  Some debugging showed that all of these errors
were generated by the check at the beginning of rtl8129_start_xmit()
and when it was raised dev->start was always 0!  There's only one
place that sets it to 0 (beside close) and that's the netif_stop_-
tx_queue a little bit further down.

    outl(tp->tx_flag | (skb->len >= ETH_ZLEN ? skb->len : ETH_ZLEN),
         ioaddr + TxStatus0 + entry*4);

    /* There is a race condition here -- we might read dirty_tx, take an
       interrupt that clears the Tx queue, and only then set tx_full.
       So we do this in two phases. */
    if (++tp->cur_tx - tp->dirty_tx >= NUM_TX_DESC) {
        set_bit(0, &tp->tx_full);
        if (tp->cur_tx - (volatile unsigned int)tp->dirty_tx < NUM_TX_DESC) {
            clear_bit(0, &tp->tx_full);
        } else
    } else

For one it seems strange to stop the queue and then there's a race
condition regarding dev->tbusy.  If an irq happens between the second
dirty_tx check and the netif_stop_tx_queue the irq unpauses the queue
and here it's paused again.  I simply removed the netif_stop_tx_queue
call (the queue is already paused at the beginning) and now the card
seems to work without problems.

But, I think there's another race in the above code snippet.  I'm
not very familiar with the networking details of the kernel nor did
I read the rtl8139 manual but I guess that the outl(..., TxStatus0)
starts the tx on the chip.  Interrupts seems to be enabled when this
code runs.  So what happens when an irq happens between this outl
and the ++tp->cur_tx and somehow the irq takes so long that the
packet is transmitted while the irq-handler is still running?  It
will ack the interrupt for this packet without actually processing
it.  Now the packet hangs in the buffer until another packet is sent.

Ciao, ET.

PS: I'm not on the list so please CC me.

PPS: Thanks _very_ much that driver is still maintained for 2.0 kernels!