Some comments...

yhersch yhersch@allot.com
Wed Sep 8 09:04:13 1999


Hi,

I've been following the various discussions concerning the operation (or 
inoperation?) of the eepro100. Until now I haven't had much to contribute. 
However, things got hairy and I had no choice but to figure out what's 
going on. Some observations...

1) My feeling (OK, this isn't an observation) all along has been that the 
Intel chip itself has some basic flaw. It seems to get confused and there 
is no way to recover gracefully. I have no proof, but look at the topics 
discussed in this mailing list (receive hangs, transmit timeouts, etc). On 
second thought, maybe this IS an observation.

2) We (Allot Communications) started experiencing crashes when we upgraded 
to a faster system board. I made an assumption (yes, I know what ass-u-me 
means), at least for this exercise (other possibilities of course exist) 
that the problem was timing based. More specifically, the new system board 
is TOO fast, and the NIC can't keep up. This could be caused by an improper 
board design, which doesn't allow certain signals to stabilize properly 
(quickly enough), or it could be a bug in the NIC itself (see #1 above). 
Another possibility is that the chip just isn't designed to operate in 
high-speed systems, and either certain hardware or software design changes 
or workarounds are necessary. Workarounds make me nervous - they often 
translate into reduced performance.

3) So, I got my hands dirty and started mucking around with the driver. 
Most of my experiments involved various delays and code shuffling in the 
driver's interrupt routine. Yeah, you all read correctly, delays in an 
interrupt routine - If any of my computer science instructors were dead 
today they'd be rolling in their graves. Of interest:
==> The proper delay inserted between reading the interrupt status and 
acking the interrupts (writing back to the same register) keeps the board 
from crashing. The size of the delay is particularly sensitive - if too 
low, the system crashes; if too high, the ISR is overworked. Performance 
results were varied based on different delay values.
Acking the interrupts twice (two sequential writes to the status register) 
also kept the system from crashing, however performance suffered 
significantly.
I was unsuccessful in my attempts at removing the delay by shuffling the 
code around. The system continued to crash. More research and 
experimentation is necessary to find another solution to the delay. In my 
opinion, adding a delay is an evil workaround due to faulty hardware 
behavior and it will negatively affect performance.

4) I discovered some potential problems with the driver itself. The Intel 
User's Guide clearly RECOMMENDS that all accesses to the command and status 
registers be limited to byte-wide access to avoid any side-effects. 
However, the driver uses only word-wide access to these registers. There 
might be nothing more sinister in this than the fact that Intel is 
recommending good programming practice. However, I know what it means when 
my wife RECOMMENDS that I tackle some chores around the house. It might be 
that there is in fact a problem with word-wide access, and the driver needs 
to be rewritten, or seriously massaged.

5) The loop in the wait_for_cmd_done() routine might be too short for very 
fast boards. I changed the loop from 100 to 10000. Is this too high, or too 
low? It seems that this keeps the system more stable, but I don't have any 
positive proof (yet).

6) Intel documentation states clearly that the CU Start and RU Start should 
only be executed when the unit is in either the idle or no resources state. 
This is not always checked. For example, in the ISR, the RxStart command 
(RX_START in older drivers) is issued without first invoking 
wait_for_cmd_done(). It seems to me that unless it's 100% sure that the 
receive unit is idle here, wait_for_cmd_done() should be called. Also as I 
recall, there are one or two other places in the driver where either the 
RxStart or CuStart commands are issued without first invoking 
wait_for_cmd_done().

7) The transmit routine has a somewhat lengthy section of code in which 
interrupts are disabled. It seems to me that perhaps it would be worthwhile 
seeing if there is a way to redesign this area to eliminate (or at least 
shorten the duration of) the interrupts being disabled.


Using version 1.05 of the driver, I was able to come up with a stable 
working version of the driver. This was accomplished by doing the 
following:
- In the speedo_interrupt() routine, I added a delay - udelay(2) - right 
after reading the interrupt status.
- Changed the wait_for_cmd_done() loop to 10000.
- Made sure that wait_for_cmd_done() was invoked every place that the 
RxStart or CuStart commands are issued.

I hope that I've contributed some useful ideas and haven't just waisted 
mailing list bandwidth. I'm continuing my experiments and maybe something 
will come of all this. I'll keep you all posted.

Thanks of course goes to Donald Becker. Along with Daniel Veillard, I too 
find it amazing that just about every NIC driver has Donald's name as the 
author. Doesn't the guy ever sleep?!

Regards,

Yisrael (Russ) Hersch
Allot Communications
yhersch@allot.com