Some comments...

Wed Sep 8 12:24:35 1999

Hi,

I've also have been following the various discussions concerning the
operation (or 
inoperation?) of the eepro100 and until now I haven't had much to
contribute. But I
am very interested in the experiences of other in getting this device to
work in a
high performance environment.

I am using this device in an embedded system where the device is used as a
high speed
datalink for bridging application in some network equipment.

We have found that the Intel 82559 Ethernet device does not conform to
published specifications regarding certain transmit status bits.
Specifically, we have found that the device does not properly update the
completion bit in the Transmit Descriptor (TxCB) status word.

We have implemented our transmit algorithm described in section 9.2 of the
Intel 82558 Software Developer's Manual (Intel document 687805-003). We have
also conformed to the operation of the transmit command as described in
section 6.2.2.6.2 of the aforementioned manual. We are using a TxCB ring
with 16 entries. The ring is statically allocated. We are using flexible
transmit mode so the SF bit is set to '1' for all TxCB entries. 

The TxCB ring entries are initialized such that the SF, C, and OK bits are
set to '1' and the command (CMD) bits are set to 100 (binary) for the
transmit command. The device Command Unit (CU) is placed into the suspend
state prior to the first frame being transmitted.

We use three TxCB ring pointers to track the active head and tail of the
ring. Outgoing frames are placed by the CPU on the ring tail which is
tracked by the pointer TxCB_ring_tail. The ring head pointer TxCB_ring_head
is used to track the completed frames, i.e. those frames transmitted by the
device and whose completion bit has been set by the device.

The TxCB ring pointers are initialized as follows:
	TxCB_ring_head = TxCB_entry[0]
	TxCB_ring_tail = TxCB_entry[0]
	previous_TxCB_ring_tail = TxCB_entry[15]

So, when we begin transmitting, we start using the TxCB ring at entry
TxCB_entry[0], incrementing the TxCB_ring_tail as we transmit each
successive frame. For each transmitted frame, the CPU sets (to '1') the S
bit of the TxCB command word, clears ('0') the completion bit of that entry,
and clears ('0') the S bit of the previous TxCB entry
(previous_TxCB_ring_tail). The CPU then issues a RESUME command to the
device CU. This is in conformance with the transmit algorithm described in
section 9.2 of the Intel 82558 Software Developer's Manual.

The following snapshot of the 82559 device Transmit Descriptor Ring (TxCB
Ring) illustrates the claim that the device, in certain cases, does not
properly update the completion bit in the TxCB status word. This snapshot
lists each TxCB ring entry along with the actual TxCB entry address, the
contents of the TxCB command/status word, and the address of the associated
transmit buffer.

TxCB entry [0]: 0x50eaf0, 0x1ca000, 0x53656a
TxCB entry [1]: 0x50eb0c, 0x1ca000, 0x53620a
TxCB entry [2]: 0x50eb28, 0x1ca000, 0x535e6a
TxCB entry [3]: 0x50eb44, 0x1ca000, 0x535aea
TxCB entry [4]: 0x50eb60, 0x1ca000, 0x53574a
TxCB entry [5]: 0x50eb7c, 0x1ca000, 0x5353ea
TxCB entry [6]: 0x50eb98, 0x1ca000, 0x53504a
TxCB entry [7]: 0x50ebb4, 0x1ca000, 0x534cca
TxCB entry [8]: 0x50ebd0, 0x1ca000, 0x53492a
TxCB entry [9]: 0x50ebec, 0x1ca000, 0x5345ca
TxCB entry [10]: 0x50ec08, 0x401ca000, 0x53422a	(previous_TxCB_ring_tail)
TxCB entry [11]: 0x50ec24, 0x1ca000, 0x0		(TxCB_ring_tail)
TxCB entry [12]: 0x50ec40, 0x1ca000, 0x0
TxCB entry [13]: 0x50ec5c, 0x1c0000, 0x536c8a	(TxCB_ring_head)
TxCB entry [14]: 0x50ec78, 0x1ca000, 0x53702a
TxCB entry [15]: 0x50ec94, 0x1ca000, 0x53690a

When we subject the device to a sustained stream of 64 byte frames, we
observe that, in some cases,  the device does not properly update the
completion bit of a TxCB entry for which it has clearly transmitted the
frame. We observe that the device exhibits this problem within 500 frames
(typically less than 100 frames). We can illustrate this for a certain case
where this error occurred. In this case, the device was offered 64 frames.
The frame size was 64 bytes (not including the CRC). The interframe delay
was 0.02 milliseconds. The line speed was 10Mb. The device actually
transmitted 27 of the 64 frames (confirmed by an Ethernet line monitor). The
last frame actually transmitted by the device was TxCB_entry[10]. The
transmit routine was implemented such that a maximum of 13 frames may be
outstanding on the TxCB ring. That is, the TxCB_ring_head must be less than
13 entries behind the TxCB_ring_tail as it follows the TxCB_ring_tail
pointer around the ring.

The CPU must "scavenge" the TxCB ring prior transmitting a frame in order to
recycle those TxCB entries that have been transmitted (and presumably have
the completion bit set by the device). The TxCB_ring_head indicates the TxCB
entry that is next to be scavenged. In this case, the device actually
transmitted 27 frames, so it had traversed the ring fully once and had
wrapped around (as expected). This means that the last frame actually
transmitted was TxCB_entry[10]. The CPU was attempting to reclaim (scavenge)
TxCB_entry[13] prior to attempting to transmit the 28th frame using
TxCB_entry[11]. The snapshot of the TxCB ring illustrates that the
completion bit (and OK bit) of TxCB_entry[11] was not set to '1' as expected
(bits 15 and 13 respectively). The TxCB status word for TxCB_entry[11] was
set to 0x1c0000 instead of 0x1ca000. Since the Ethernet line monitor
confirmed that the buffer data associated with TxCB_entry[11] was actually
transmitted, and that the data buffers for the subsequent TxCB entries were
also transmitted, it seems that the device failed to set the completion bit
for TxCB_entry[11], but then correctly set the completion and OK bits for
the subsequently transmitted TxCB entries.

Note that the device CU remained in the SUSPENDED state when this problem
occurred; it did not enter the IDLE state.

We have observed that we must increase the interframe delay to be greater
than about 5 milliseconds in order to reliably sustain a stream of 64 byte
frames using this device. We have also observed that if we increase the
frame size to be larger than 80 bytes, the device does not exhibit this
problem.

-----Original Message-----
From: yhersch [mailto:yhersch@allot.com]
Sent: Wednesday, September 08, 1999 11:10 AM
To: 'linux-eepro100@beowulf.gsfc.nasa.gov'
Subject: Some comments...

Hi,

I've been following the various discussions concerning the operation (or 
inoperation?) of the eepro100. Until now I haven't had much to contribute. 
However, things got hairy and I had no choice but to figure out what's 
going on. Some observations...

1) My feeling (OK, this isn't an observation) all along has been that the 
Intel chip itself has some basic flaw. It seems to get confused and there 
is no way to recover gracefully. I have no proof, but look at the topics 
discussed in this mailing list (receive hangs, transmit timeouts, etc). On 
second thought, maybe this IS an observation.

2) We (Allot Communications) started experiencing crashes when we upgraded 
to a faster system board. I made an assumption (yes, I know what ass-u-me 
means), at least for this exercise (other possibilities of course exist) 
that the problem was timing based. More specifically, the new system board 
is TOO fast, and the NIC can't keep up. This could be caused by an improper 
board design, which doesn't allow certain signals to stabilize properly 
(quickly enough), or it could be a bug in the NIC itself (see #1 above). 
Another possibility is that the chip just isn't designed to operate in 
high-speed systems, and either certain hardware or software design changes 
or workarounds are necessary. Workarounds make me nervous - they often 
translate into reduced performance.

3) So, I got my hands dirty and started mucking around with the driver. 
Most of my experiments involved various delays and code shuffling in the 
driver's interrupt routine. Yeah, you all read correctly, delays in an 
interrupt routine - If any of my computer science instructors were dead 
today they'd be rolling in their graves. Of interest:
==> The proper delay inserted between reading the interrupt status and 
acking the interrupts (writing back to the same register) keeps the board 
from crashing. The size of the delay is particularly sensitive - if too 
low, the system crashes; if too high, the ISR is overworked. Performance 
results were varied based on different delay values.
Acking the interrupts twice (two sequential writes to the status register) 
also kept the system from crashing, however performance suffered 
significantly.
I was unsuccessful in my attempts at removing the delay by shuffling the 
code around. The system continued to crash. More research and 
experimentation is necessary to find another solution to the delay. In my 
opinion, adding a delay is an evil workaround due to faulty hardware 
behavior and it will negatively affect performance.

4) I discovered some potential problems with the driver itself. The Intel 
User's Guide clearly RECOMMENDS that all accesses to the command and status 
registers be limited to byte-wide access to avoid any side-effects. 
However, the driver uses only word-wide access to these registers. There 
might be nothing more sinister in this than the fact that Intel is 
recommending good programming practice. However, I know what it means when 
my wife RECOMMENDS that I tackle some chores around the house. It might be 
that there is in fact a problem with word-wide access, and the driver needs 
to be rewritten, or seriously massaged.

5) The loop in the wait_for_cmd_done() routine might be too short for very 
fast boards. I changed the loop from 100 to 10000. Is this too high, or too 
low? It seems that this keeps the system more stable, but I don't have any 
positive proof (yet).

6) Intel documentation states clearly that the CU Start and RU Start should 
only be executed when the unit is in either the idle or no resources state. 
This is not always checked. For example, in the ISR, the RxStart command 
(RX_START in older drivers) is issued without first invoking 
wait_for_cmd_done(). It seems to me that unless it's 100% sure that the 
receive unit is idle here, wait_for_cmd_done() should be called. Also as I 
recall, there are one or two other places in the driver where either the 
RxStart or CuStart commands are issued without first invoking 
wait_for_cmd_done().

7) The transmit routine has a somewhat lengthy section of code in which 
interrupts are disabled. It seems to me that perhaps it would be worthwhile 
seeing if there is a way to redesign this area to eliminate (or at least 
shorten the duration of) the interrupts being disabled.

Using version 1.05 of the driver, I was able to come up with a stable 
working version of the driver. This was accomplished by doing the 
following:
- In the speedo_interrupt() routine, I added a delay - udelay(2) - right 
after reading the interrupt status.
- Changed the wait_for_cmd_done() loop to 10000.
- Made sure that wait_for_cmd_done() was invoked every place that the 
RxStart or CuStart commands are issued.

I hope that I've contributed some useful ideas and haven't just waisted 
mailing list bandwidth. I'm continuing my experiments and maybe something 
will come of all this. I'll keep you all posted.

Thanks of course goes to Donald Becker. Along with Daniel Veillard, I too 
find it amazing that just about every NIC driver has Donald's name as the 
author. Doesn't the guy ever sleep?!

Regards,

Yisrael (Russ) Hersch
Allot Communications
yhersch@allot.com