[eepro100-bug] i82559er problem
Donald Becker
becker@scyld.com
Fri Aug 9 08:28:34 2002
On Fri, 9 Aug 2002, Pavan Sikka wrote:
> We have been using computer boards manufactured by EEPD (www.eepd.com)
> that have worked very well for us over the last few years. Recently,
> they changed the on-board network chip from the stock i82559 (PCI dev
> id 0x1229) to the 82559ER (PCI dev id 0x1209). With this chip, as soon
> as the network traffic increased, the network crashed.
Yup, some (but not all?) 559ER chips seem to have a problem when running
short of Rx buffers. That is supposed to trigger flow control, but
instead the command unit hangs.
> Problem (using version 1.24 of the driver off www.scyld.com):
>
> To demonstrate the problem:
>
> 1. On the "problem" machine:
> ping -f -s 1024 <remote-machine>
>
> 2. On the remote machine:
> ping -f -s 1024 <problem-machine>
This _will_ run short of Rx buffers. The linux buffer allocator touches
many pages and thus slows down the system.
> We change some of the parameters as follows:
>
> Tx/Rx DMA burst long to 127
> Tx/Rx Ring size from 32 to 128
If you have plenty of memory, change the Rx ring size to 1024.
Do not change the Tx ring size.
> Tx Queue size from 12 to 100
> Tx Queue unfull from 8 to 100
This is evil. First, getting rid of the full/available hysteresis is a
significant performance hit. And what are you doing with a 100 packets
queued up?
> Tx Timeout from 2*Hz to 4*Hz
I'm gradually changing all of the drivers to have timeouts great than 3
seconds to handle a worst-case link renegotiation.
> With these new parameters, the network survives for a few minutes.
What is the failure mode?
> 1. Are there any known issues with the i82559ER chip ?
Yes, but Intel won't tell us. I've been throught the rounds of "Are
there any bugs" "no" "What about this one" "That's the only one".
> 2. Given the info below, do you think the eeprom contents are sane ?
Yes. This bug looks like the Sleep-mode bug, but it's obviously a new one.
> 3. I have tried very hard to get the software manual for this chip but
> have not yet succeeded. I will be happy to sign an NDA but I cant seem
> to find anyone in Intel to talk to (I am in Australia). Could you
> provide a contact in Intel who could facilitate this ?
_I_ can't get updated programming information. Some parts of Intel
can't figure out who their supporters are...
> Some interesting bits from /var/log/messages when the network crashes:
>
> Aug 9 16:32:49 load2 kernel: Command 80 was not immediately accepted,
> 10001 ticks!
No command should take 10000 PCI bus cycles. The chip's Command Unit
(the transmit list) internal firmware seems to have crashed at this point.
--
Donald Becker becker@scyld.com
Scyld Computing Corporation http://www.scyld.com
410 Severn Ave. Suite 210 Second Generation Beowulf Clusters
Annapolis MD 21403 410-990-9993