[eepro100] Command unit failed to mark command 00000000 ascomplete -- what does it mean?

Robert C. Paulsen, Jr. paulsen@texas.net
Tue, 18 Jul 2000 19:50:18 -0500


Donald, 

Thanks for the reply.

The version of the driver (from the source) is: eepro100.c:v1.09r2 10/15/99.
This is from a SuSE 6.4 distribution. The card itself has the following 
markings on the chip:

	582557
	L7233192
	SL24Z
	(c) 1989 1995

I have swapped out the eepro100 for a RealTek RTL8139 and am now using your
driver: rtl8139.c:v1.08 6/25/99. So far, so good! (And your reputation is
just fine with me!)

Donald Becker wrote:
> 
> On Mon, 17 Jul 2000, Robert C. Paulsen, Jr. wrote:
> 
> > Subject: [eepro100] Command unit failed to mark command 00000000 as complete
>     -- what does it mean?
> >
> > My var log messages file has a few hundred of the following messages.
> > This started about 3 days ago.
> 
> What driver version are you using?
> 
> > Jul 17 14:46:21 home kernel: eth0: Command unit failed to mark command 00000000 as complete at 78644.
> 
> This message indicates that the eepro100 you are using has a bug where it
> skipped marking a command as complete.
> 
> When this occurs it means that the chip has corrupted its internal state.
> The driver can reset the chip, but the same problem will recur almost
> immediately.  The driver recovers from this problem, but the recovery is
> slower than normal operation. The only full recovery seems to be a hard
> reset or powering off the system.
> 
> This bug appears on no errata list that I have seen.  It seems to affect
> only a few chip versions, and be triggered by only some motherboards.
> 
> This bug was a nasty problem, and it gave me a bad reputation.  It's the
> kind of bug where it would happen to someone, they would make a random
> change to the driver, and their updated driver would run reliably for a
> week.  They would submit the change as a "bug fix".  When I stated that
> their change didn't fix any obvious bug, they would stomp off and call me
> names.  After all, they had seen my driver stop repeated in the span of a
> few minutes, and their driver just ran for a whole week without a problem.
> This very situation happened to Linus, and he never admitted that his
> changes to eepro100 didn't fix the problem.  He just believed that I had
> some other hidden flaw in the driver.
> 
> In v1.09s I added an explicit check for this case.  Here is that change
> log entry -- look at entry #7.  At this point I still wasn't certain that
> descriptor skipping was A Bug:
> 
> ________________
> date: 1999/09/30 00:55:38;  author: becker;  state: Exp;  lines: +283 -222
> eepro100.c v1.09s 9/29/99
> Updated to track the "kern-2.3" version.
> 
> Added TX_QUEUE_UNFULL, the queue length where we once again accept Tx packets.
> 
> Shuffled the kernel version compatibility code around and added local version
> of the pci-scan routines.
> 
> Added a new PCI device ID  0x1029, reported by Russ Nelson.
> 
> Changed clear_suspend() to use a byte write rather than an atomic bit op.
> 
> Changed the Tx-timeout check to avoid false triggers.  This included adding
> a last_cmd_time variable.
> 
> Changed to struct net_device from struct device.
> 
> Always write SCBCmd as byte-wide rather than word-wide.
> 
> Added explicit descriptor-skipped check when scavenging the command list.
> 
> Reset the chip when shutting down the interface, rather than just stopping it,
> to disable flow control packets that might be sent.
> 
> Changed the ordering of command queue operations to eliminate the window
> where sp->cur_tx points to a net-yet-valid command.  We should no longer need
> a lock in the interrupt routine, and the locked regions when adding a command
> are shorter.  (Note: the locks have not been moved to take advantage of this.)
> ----------------------------
> 

-- 
____________________________________________________________________
Robert Paulsen                                     paulsen@texas.net