bug in tulip_rx?

Mon Nov 22 23:10:59 1999

Hi,

We have a server with heavy load (2.2.14pre4, 64MB Ram, D-Link 530CT+) running
squid.

The problem is that it sometimes just stops receiving frames. A ifconfig
down and then restart of the network cures the problem. Nothing is logged.

The same machine just worked fine with 2.0.38 for month.

So I first thought it might be a problem of the tulip-driver of 2.2.14pre4
and I replaced it with several older versions - didn't help.

I now took a closer look at the drivers and I think I found a problem. Please
correct me if I'm wrong.

The following situation may happens:

	cur_rx == dirty_rx + RX_RING_SIZE

This case happens if (and only if) we cannot allocate any skb for the
receive-ring. In this situation therefor we cannot receive frames any more.

If a frame arrives, the 21041 sets a RU-interrupt and the receiving process
is suspended. tulip_rx is entered. 2 possibilities:

1. we can allocate at least one skb. In this case dirty_rx gets increased and
21041 will switch in running mode when the next frame arrives.

2. we cannot allocate a skb. In this case 21041 remains in suspend mode and
will not send any further interrupt (if I read the manual correctly) which
will call tulip_rx => hang.

A workaround would either

	to use the timer of the chip and a receive poll command

or the following modification (is not meant as patch but as idea) in tulip_rx:

	unsigned int orx_dirty = tp->rx_dirty;
	int skbs = rx_work_limit;
	int lskb = -1;
	....

	while(...) {

		if ( (pkt_len < rx_copybreak || skbs == 1) 
			&& (skb = dev_alloc_skb(pkt_len + 2)) != NULL) {
			lskb = entry;
			....
		} else if (skbs != 1) {
			skbs--;
			...
		} else {
			/* drop the packet and keep the skb: it is our last
			 * and we couldn't allocate another one
			 */
			lskb = entry;
		}
		....
	}	

	/* Refill the Rx ring buffers. */
	....

	if (lskb>=0 && odirty_rx == tp->dirty_rx) {
		/* could not allocate a skb for dirty_rx */
		entry = tp->dirty_rx % RX_RING_SIZE;
		tp->rx_skbuff[entry] = tp->rx_skbuff[lskb];
		tp->rx_ring[entry].buffer1 = tp->rx_ring[lskb].buffer1;
		tp->rx_skbuff[lskb] = NULL;
	}

The above scenario can not happen any more because there will be always a
next buffer for 21041. So even if it is already in suspend mode it will
go to running mode when receiving another packet.

What do you think? Could this be the reason for these spurious hangs?

Greetings,

Wolfgang Walter