Some comments...
yhersch
yhersch@allot.com
Wed Sep 8 09:04:13 1999
Hi,
I've been following the various discussions concerning the operation (or
inoperation?) of the eepro100. Until now I haven't had much to contribute.
However, things got hairy and I had no choice but to figure out what's
going on. Some observations...
1) My feeling (OK, this isn't an observation) all along has been that the
Intel chip itself has some basic flaw. It seems to get confused and there
is no way to recover gracefully. I have no proof, but look at the topics
discussed in this mailing list (receive hangs, transmit timeouts, etc). On
second thought, maybe this IS an observation.
2) We (Allot Communications) started experiencing crashes when we upgraded
to a faster system board. I made an assumption (yes, I know what ass-u-me
means), at least for this exercise (other possibilities of course exist)
that the problem was timing based. More specifically, the new system board
is TOO fast, and the NIC can't keep up. This could be caused by an improper
board design, which doesn't allow certain signals to stabilize properly
(quickly enough), or it could be a bug in the NIC itself (see #1 above).
Another possibility is that the chip just isn't designed to operate in
high-speed systems, and either certain hardware or software design changes
or workarounds are necessary. Workarounds make me nervous - they often
translate into reduced performance.
3) So, I got my hands dirty and started mucking around with the driver.
Most of my experiments involved various delays and code shuffling in the
driver's interrupt routine. Yeah, you all read correctly, delays in an
interrupt routine - If any of my computer science instructors were dead
today they'd be rolling in their graves. Of interest:
==> The proper delay inserted between reading the interrupt status and
acking the interrupts (writing back to the same register) keeps the board
from crashing. The size of the delay is particularly sensitive - if too
low, the system crashes; if too high, the ISR is overworked. Performance
results were varied based on different delay values.
Acking the interrupts twice (two sequential writes to the status register)
also kept the system from crashing, however performance suffered
significantly.
I was unsuccessful in my attempts at removing the delay by shuffling the
code around. The system continued to crash. More research and
experimentation is necessary to find another solution to the delay. In my
opinion, adding a delay is an evil workaround due to faulty hardware
behavior and it will negatively affect performance.
4) I discovered some potential problems with the driver itself. The Intel
User's Guide clearly RECOMMENDS that all accesses to the command and status
registers be limited to byte-wide access to avoid any side-effects.
However, the driver uses only word-wide access to these registers. There
might be nothing more sinister in this than the fact that Intel is
recommending good programming practice. However, I know what it means when
my wife RECOMMENDS that I tackle some chores around the house. It might be
that there is in fact a problem with word-wide access, and the driver needs
to be rewritten, or seriously massaged.
5) The loop in the wait_for_cmd_done() routine might be too short for very
fast boards. I changed the loop from 100 to 10000. Is this too high, or too
low? It seems that this keeps the system more stable, but I don't have any
positive proof (yet).
6) Intel documentation states clearly that the CU Start and RU Start should
only be executed when the unit is in either the idle or no resources state.
This is not always checked. For example, in the ISR, the RxStart command
(RX_START in older drivers) is issued without first invoking
wait_for_cmd_done(). It seems to me that unless it's 100% sure that the
receive unit is idle here, wait_for_cmd_done() should be called. Also as I
recall, there are one or two other places in the driver where either the
RxStart or CuStart commands are issued without first invoking
wait_for_cmd_done().
7) The transmit routine has a somewhat lengthy section of code in which
interrupts are disabled. It seems to me that perhaps it would be worthwhile
seeing if there is a way to redesign this area to eliminate (or at least
shorten the duration of) the interrupts being disabled.
Using version 1.05 of the driver, I was able to come up with a stable
working version of the driver. This was accomplished by doing the
following:
- In the speedo_interrupt() routine, I added a delay - udelay(2) - right
after reading the interrupt status.
- Changed the wait_for_cmd_done() loop to 10000.
- Made sure that wait_for_cmd_done() was invoked every place that the
RxStart or CuStart commands are issued.
I hope that I've contributed some useful ideas and haven't just waisted
mailing list bandwidth. I'm continuing my experiments and maybe something
will come of all this. I'll keep you all posted.
Thanks of course goes to Donald Becker. Along with Daniel Veillard, I too
find it amazing that just about every NIC driver has Donald's name as the
author. Doesn't the guy ever sleep?!
Regards,
Yisrael (Russ) Hersch
Allot Communications
yhersch@allot.com