[eepro100] Command unit failed to mark command 00000000 as complete -- what does it mean?

Donald Becker becker@scyld.com
Tue, 18 Jul 2000 11:08:56 -0400 (EDT)


On Mon, 17 Jul 2000, Robert C. Paulsen, Jr. wrote:

> Subject: [eepro100] Command unit failed to mark command 00000000 as complete
    -- what does it mean?
> 
> My var log messages file has a few hundred of the following messages.
> This started about 3 days ago.

What driver version are you using?

> Jul 17 14:46:21 home kernel: eth0: Command unit failed to mark command 00000000 as complete at 78644.

This message indicates that the eepro100 you are using has a bug where it
skipped marking a command as complete.

When this occurs it means that the chip has corrupted its internal state.
The driver can reset the chip, but the same problem will recur almost
immediately.  The driver recovers from this problem, but the recovery is
slower than normal operation. The only full recovery seems to be a hard
reset or powering off the system.

This bug appears on no errata list that I have seen.  It seems to affect
only a few chip versions, and be triggered by only some motherboards.

This bug was a nasty problem, and it gave me a bad reputation.  It's the
kind of bug where it would happen to someone, they would make a random
change to the driver, and their updated driver would run reliably for a
week.  They would submit the change as a "bug fix".  When I stated that
their change didn't fix any obvious bug, they would stomp off and call me
names.  After all, they had seen my driver stop repeated in the span of a
few minutes, and their driver just ran for a whole week without a problem.
This very situation happened to Linus, and he never admitted that his
changes to eepro100 didn't fix the problem.  He just believed that I had
some other hidden flaw in the driver. 

In v1.09s I added an explicit check for this case.  Here is that change
log entry -- look at entry #7.  At this point I still wasn't certain that
descriptor skipping was A Bug:

________________
date: 1999/09/30 00:55:38;  author: becker;  state: Exp;  lines: +283 -222
eepro100.c v1.09s 9/29/99
Updated to track the "kern-2.3" version.

Added TX_QUEUE_UNFULL, the queue length where we once again accept Tx packets.

Shuffled the kernel version compatibility code around and added local version
of the pci-scan routines.

Added a new PCI device ID  0x1029, reported by Russ Nelson.

Changed clear_suspend() to use a byte write rather than an atomic bit op.

Changed the Tx-timeout check to avoid false triggers.  This included adding
a last_cmd_time variable.

Changed to struct net_device from struct device.

Always write SCBCmd as byte-wide rather than word-wide.

Added explicit descriptor-skipped check when scavenging the command list.

Reset the chip when shutting down the interface, rather than just stopping it,
to disable flow control packets that might be sent.

Changed the ordering of command queue operations to eliminate the window
where sp->cur_tx points to a net-yet-valid command.  We should no longer need
a lock in the interrupt routine, and the locked regions when adding a command
are shorter.  (Note: the locks have not been moved to take advantage of this.)
----------------------------

Donald Becker				becker@scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Beowulf Clusters / Linux Installations
Annapolis MD 21403