Some comments...

Scott Tyson tysons@deepwell.com
Wed Jan 12 18:39:16 2000


I have been in contact with these NICs in various forms under NT for
quite some time and I have never seen any errors or problems there.    I
also have never seen any problems under Linux but my server is a
Quake2/Quake3 server with some ftp/http.  I'm on a half duplex 10BaseT
link as well.  Am assuming my bandwidth hasn't reached a critical level
or it is full duplex that gives the NIC fits.   My Machine is a Dual
PII400 (Gigabyte MB) running redhat 6.0, kernel 2.2.13 and ver 1.06 of
the NIC driver.
My guess is that Intel has 1) worked around issues with the chipset in
their windows drivers to hide design problems and/or 2) Not released
complete specs on the board and this is causing problems.
If I knew what I was doing when it came to c or networking drivers I'd
create a driver that followed Intel's specs 100% and then work off of
that (Not that Donald has not done this).   That way you elimiate any
deviations from the specs as the culprit.  Just my $0.02.
If anyone has a way for me to test to see if I can crap out my NIC I'd
be willing to do that and feed the results back to the list.


Scott


----- Original Message -----
From: "yhersch" <yhersch@allot.com>
To: <linux-eepro100@beowulf.gsfc.nasa.gov>
Sent: Wednesday, September 08, 1999 7:09 AM
Subject: Some comments...


> Hi,
>
> I've been following the various discussions concerning the operation
(or
> inoperation?) of the eepro100. Until now I haven't had much to
contribute.
> However, things got hairy and I had no choice but to figure out what's
> going on. Some observations...
>
> 1) My feeling (OK, this isn't an observation) all along has been that
the
> Intel chip itself has some basic flaw. It seems to get confused and
there
> is no way to recover gracefully. I have no proof, but look at the
topics
> discussed in this mailing list (receive hangs, transmit timeouts,
etc). On
> second thought, maybe this IS an observation.
>
> 2) We (Allot Communications) started experiencing crashes when we
upgraded
> to a faster system board. I made an assumption (yes, I know what
ass-u-me
> means), at least for this exercise (other possibilities of course
exist)
> that the problem was timing based. More specifically, the new system
board
> is TOO fast, and the NIC can't keep up. This could be caused by an
improper
> board design, which doesn't allow certain signals to stabilize
properly
> (quickly enough), or it could be a bug in the NIC itself (see #1
above).
> Another possibility is that the chip just isn't designed to operate in
> high-speed systems, and either certain hardware or software design
changes
> or workarounds are necessary. Workarounds make me nervous - they often
> translate into reduced performance.
>
> 3) So, I got my hands dirty and started mucking around with the
driver.
> Most of my experiments involved various delays and code shuffling in
the
> driver's interrupt routine. Yeah, you all read correctly, delays in an
> interrupt routine - If any of my computer science instructors were
dead
> today they'd be rolling in their graves. Of interest:
> ==> The proper delay inserted between reading the interrupt status and
> acking the interrupts (writing back to the same register) keeps the
board
> from crashing. The size of the delay is particularly sensitive - if
too
> low, the system crashes; if too high, the ISR is overworked.
Performance
> results were varied based on different delay values.
> Acking the interrupts twice (two sequential writes to the status
register)
> also kept the system from crashing, however performance suffered
> significantly.
> I was unsuccessful in my attempts at removing the delay by shuffling
the
> code around. The system continued to crash. More research and
> experimentation is necessary to find another solution to the delay. In
my
> opinion, adding a delay is an evil workaround due to faulty hardware
> behavior and it will negatively affect performance.
>
> 4) I discovered some potential problems with the driver itself. The
Intel
> User's Guide clearly RECOMMENDS that all accesses to the command and
status
> registers be limited to byte-wide access to avoid any side-effects.
> However, the driver uses only word-wide access to these registers.
There
> might be nothing more sinister in this than the fact that Intel is
> recommending good programming practice. However, I know what it means
when
> my wife RECOMMENDS that I tackle some chores around the house. It
might be
> that there is in fact a problem with word-wide access, and the driver
needs
> to be rewritten, or seriously massaged.
>
> 5) The loop in the wait_for_cmd_done() routine might be too short for
very
> fast boards. I changed the loop from 100 to 10000. Is this too high,
or too
> low? It seems that this keeps the system more stable, but I don't have
any
> positive proof (yet).
>
> 6) Intel documentation states clearly that the CU Start and RU Start
should
> only be executed when the unit is in either the idle or no resources
state.
> This is not always checked. For example, in the ISR, the RxStart
command
> (RX_START in older drivers) is issued without first invoking
> wait_for_cmd_done(). It seems to me that unless it's 100% sure that
the
> receive unit is idle here, wait_for_cmd_done() should be called. Also
as I
> recall, there are one or two other places in the driver where either
the
> RxStart or CuStart commands are issued without first invoking
> wait_for_cmd_done().
>
> 7) The transmit routine has a somewhat lengthy section of code in
which
> interrupts are disabled. It seems to me that perhaps it would be
worthwhile
> seeing if there is a way to redesign this area to eliminate (or at
least
> shorten the duration of) the interrupts being disabled.
>
>
> Using version 1.05 of the driver, I was able to come up with a stable
> working version of the driver. This was accomplished by doing the
> following:
> - In the speedo_interrupt() routine, I added a delay - udelay(2) -
right
> after reading the interrupt status.
> - Changed the wait_for_cmd_done() loop to 10000.
> - Made sure that wait_for_cmd_done() was invoked every place that the
> RxStart or CuStart commands are issued.
>
> I hope that I've contributed some useful ideas and haven't just
waisted
> mailing list bandwidth. I'm continuing my experiments and maybe
something
> will come of all this. I'll keep you all posted.
>
> Thanks of course goes to Donald Becker. Along with Daniel Veillard, I
too
> find it amazing that just about every NIC driver has Donald's name as
the
> author. Doesn't the guy ever sleep?!
>
> Regards,
>
> Yisrael (Russ) Hersch
> Allot Communications
> yhersch@allot.com
>
>

-------------------------------------------------------------------
To unsubscribe send a message body containing "unsubscribe"
to linux-eepro100-request@beowulf.org