[eepro100-bug] i82559er problem

Donald Becker becker@scyld.com
Fri Aug 9 08:28:34 2002

On Fri, 9 Aug 2002, Pavan Sikka wrote:

> We have been using computer boards manufactured by EEPD (www.eepd.com)
> that have worked very well for us over the last few years. Recently,
> they changed the on-board network chip from the stock i82559 (PCI dev
> id 0x1229) to the 82559ER (PCI dev id 0x1209). With this chip, as soon
> as the network traffic increased, the network crashed.

Yup, some (but not all?) 559ER chips seem to have a problem when running
short of Rx buffers.  That is supposed to trigger flow control, but
instead the command unit hangs.

> Problem (using version 1.24 of the driver off www.scyld.com):
> To demonstrate the problem:
> 1. On the "problem" machine:
>      ping -f -s 1024 <remote-machine>
> 2. On the remote machine:
>     ping -f -s 1024 <problem-machine>

This _will_ run short of Rx buffers.  The linux buffer allocator touches
many pages and thus slows down the system.

> We change some of the parameters as follows:
> Tx/Rx DMA burst long to 127
> Tx/Rx Ring size from 32 to 128

If you have plenty of memory, change the Rx ring size to 1024.
Do not change the Tx ring size.

> Tx Queue size from 12 to 100
> Tx Queue unfull from 8 to 100

This is evil.  First, getting rid of the full/available hysteresis is a
significant performance hit.  And what are you doing with a 100 packets
queued up?

> Tx Timeout from 2*Hz to 4*Hz

I'm gradually changing all of the drivers to have timeouts great than 3
seconds to handle a worst-case link renegotiation.

> With these new parameters, the network survives for a few minutes.

What is the failure mode?

> 1. Are there any known issues with the i82559ER chip ?

Yes, but Intel won't tell us.  I've been throught the rounds of "Are
there any bugs" "no" "What about this one" "That's the only one".

> 2. Given the info below, do you think the eeprom contents are sane ?

Yes.  This bug looks like the Sleep-mode bug, but it's obviously a new one.

> 3. I have tried very hard to get the software manual for this chip but
> have not yet succeeded. I will be happy to sign an NDA but I cant seem
> to find anyone in Intel to talk to (I am in Australia). Could you
> provide a contact in Intel who could facilitate this ?

_I_ can't get updated programming information.  Some parts of Intel
can't figure out who their supporters are...

> Some interesting bits from /var/log/messages when the network crashes:
> Aug  9 16:32:49 load2 kernel: Command 80 was not immediately accepted, 
> 10001 ticks!

No command should take 10000 PCI bus cycles.  The chip's Command Unit
(the transmit list) internal firmware seems to have crashed at this point.

Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993