[eepro100] Short technical discussion about "No (RX) resources"

Mon Jun 17 05:12:01 2002

Hello EEPRO100 hackers, hello Donald Becker,

I want to discuss the reasons for the error messages
	No resources	
	No RX resources

I do not want to speak about versions of driver and
if the driver is from the SCYLD donwloaded or if the
driver comes from Linux.

For discussing the problem with high-speed machines, I
use to versions as examples: Linux-2.2.18 driver and
thre current scyld driver of the eepro100.c.

I ported a Linux-2.2.18 driver to our company own
operating system (rail station and and rail controlling),
and we found following situation:

On machines up to PIII 500 (or more) no problems exist.
On machines of PIII 700 sometimes the two error messages
are seen, on fater machines (> 900MHz) this error message
is always shown and the NIC is not working.

This effect is complement to the "feelings" of a driver
writer, as on a faster machine the refill of RX buffers is
more propable and the "NO (RX) resources" should never
occure. On slow machines the propability is higher, that
a lack of resources may occure.

On our system the number of skb buffers are fixed to 600
for 3 interfaces. Each interface has 64 RX buffers reserved,
the STREAMS driver on top of it can hold at maximum 128
skbs, then it starts to drop.

The test scenario only inlucde ONE interface, so this one
interface can work with 472 skbs, of them are 64 reserverd
for RX on it's own. There is definitively no resource problem
getting skbs !

As seen in the mailing list, you often spoke from increasing
the number of resources, but I think, this is not the 
problem here.

Does the system loose interrupts on fast machines (so packets
are not freed and the RX buffers are no refilled), or were
there changes in done in the code to avoid this affect.

Lets discuss some points in the code, which are seen as
diffs between the 2.2.18 Linux drivers and the current down-
loaded version of SCYLD. I only pointed out diffs, which are
relevant for the problem described above, IMO:

* In `wait_for_cmd_done()':
	This routine works in a different way.
	What was the rationale for changing this ?

* Introductiuon of `do_slow_command()'

* Chnaged code in `speedo_resume()' (this is only relevant
	for recovering the NIC after a TIMEOUT situation, 
	is this correct ?).
	What was the rationale for changing this ?

* Changed triggering in `speedo_start_xmit()':
	In older Linux code: The order is
		wait_for_cmd_done()
		clear_suspend()
		do CUResume
		spin_unlock_irqrestore()
	In current scyld code:
		clear_suspend()
		flow control for TX queues of linux kernel
		spin_unlock_irqrestore()
		wait_for_cmd_done()
		do CUResume
	What was the rationale for changing this ?

* Additional handling of RXSuspend Interrupt

Further keywords can be discussed:

* Use of different FIFO threshold(s) 
* max interrupt work (200 at linux-2.2.18 and 20 @scyld)	

I hope you are interesting in such a short discussion 
having the technical view and pointing out, which the
critical code is, having the problem on fast machines.

BTW: I have the NDA with intel (in the company) and have
the document for the EEPRO 100, so you can also point to
chapters or refs.

Christoph Plattner

------------------------------------------------------------------
private:  christoph.plattner@gmx.at
company:  christoph.plattner@alcatel.at