[eepro100] Transmitter Timeout -- addednum

Donald Becker becker@scyld.com
Tue, 1 Aug 2000 02:47:52 -0400 (EDT)


On Sun, 30 Jul 2000, Kallol Biswas wrote:

> I don't know about the latest eepro100 driver, but the version 
> I saw had a fundamental design problem, again I will try explain:

The current version is at
    ftp://www.scyld.com/pub/network/eepro100.c
The test version is at
    ftp://www.scyld.com/pub/network/test/eepro100.c

>    82559 prefetches the next command from the command ring,
> suppose the cmd unit is executing ith command and has 
> has prefetched the next one, i.e. (i+1)th already, driver 
> sets up the the (i+1)th cmd, sets the S bit and sends RESUME,
> if the CU:
> 	*in Suspended state it goes to active state, does not re-read next
> link ponter(address for i+1th) re-reads the Sbit of of ith command.
> If the Sbit of ith command is cleared then executes the i+1th otherwise
> goes back to suspended state.

Correct.  The link point to the command with the suspend bit is read only
once.  The chip reads only the command line on subsequent polls, and only
pays attention to the suspend bit.

>          *If CU is active it checks the validity of S bits of next(i+1 th)
> and present(ith) cmd(PCI cmd 0x6 MR is used to re-read Sbit of a TxCB, I saw
> it on analyzer).
> Please note that it does not say it re-analize the next(i+1 th) command but
> the S bit.
>
> So if the i+1 th command was a previously executed say transmit cmd and 
> driver sets up now as a say multicast cmd then the card executes
> i+1 th cmd with invalid parameters, and the card stall.

Hmmm, the documentation does states that the chip will examine the 'S' bit
of the i+1 command, but this is only an optimization and should not impact
correct operation.  If the suspend bit is set on the current i'th command
(as it will be if the subsequent command is still being constructed), the
chip will suspend.  It will then re-read the i+1'th command at the next
RESUME command.

In general, the operation is
    The CU first reads the current "i"th command, using a burst read.
      It stores the link pointer for later use.
      It examines the S bit.
    Only when the S bit is clear does the chip interpret the subsequent
      command. 

> Our initial version of the 82559 driver would hang on an Itanium processor
> based system because of this problem, but adding a NOP after a
> cmd has solved the problem. Now our stress tests run for days without
> any problem on 82559. 

Are you certain that you are not seeing the CU_RESUME command by-pass the
next descriptor initialization that is still sitting in a write buffer?

> > v1.06 of the driver seemed to handle the TX timeouts a quicker then
> > v1.09, but in v1.09 they were less frequent.  I tried to compile v1.10
> > and experimental v1.11, but I got all types of compile errors and did
> > not have the motivation to port them to v2.2.16 of the kernel after all
> > my above failures.

It's not difficult: just read
   http://www.scyld.com/network/updates.html

I converted my drivers to pci-scan and kern_compat.h at the request of Linus
for no backward-compatible code in the driver.  It has turned out to be a
big support problem -- the previous method of everything in a single *.c
file is much easier for users.

> > I have NO IDEA what is causing these TX timeouts. . . if any of the
> > gurus here would be as kind as to aide me in my efforts to figure this
> > out, I would greatly appreciate it!  I will grant accounts on the
> > troublesome machine if that will aide in trouble-shooting, and I will
> > code whatever I can if anyone can give me a direction to go in. . . 

Please try v1.10 or v1.11.  It should fix the problem.

Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Beowulf Clusters / Linux Installations
Annapolis MD 21403