3com 3c905c-txm

Donald Becker becker@scyld.com
Sat May 13 00:15:49 2000


On Sat, 13 May 2000, Andrew Morton wrote:
> Donald Becker wrote:
> > 
> > [ big snip.  I'll follow up on linux-vortex ]
> >
> > 4000 PCI cycles is a *very* long time.  Ages.  It should only 0 to 2 PCI
> > cycles to queue a packet.

To clarify: the "0 to 2 PCI cycles" is the count for ideal hardware, not
necessarily what we have to work with now.  The best case is either
   Just queuing the descriptor, knowing that chip will be looking at the
   descriptor sometime soon.  This is possible if you check that
     1) the previously queued packet has not yet been sent, and
     2) the transmit queue has at least two packet waiting.

   A single PCI memory write that wakes the Tx unit up.  PCI memory writes
   are very efficient on some systems, since they are queued in a write
   buffer while the CPU continues to work.  I/O space writes usually have
   the semantics that they much complete before the processor can do more
   work.  That might be over a microsecond, or about a thousand instructions
   on a fast machine.

   Having to read, either I/O or memory, is always expensive.

> Alas I seem to have lost the ability to reproduce[*].

You imagined the whole thing.
Quit eating those mushrooms.

I know all about not being able to reproduce problems.  Some versions of the
eepro100 chip have a bug where they switch into "broken mode".  The hardware
and driver will work fine for weeks, then something will go wrong
(presumably with the internal firmware). Despite resetting everything, the
chip will stop again after sending just a few packets.

The problem for me is when someone encounters this, makes a driver change,
and their modified driver works for a week without a problem.  They proclaim
that their new driver is much more reliable, and that they have fixed The
Bug.

Usually they haven't fixed anything, or even introduced bugs, but to them
all evidence points to a successful fix.  When I say "that's not a fix",
they bypass me and submit a patch to Linus.  Linus, not knowing the whole
story, puts the patch in.  After all, here is someone Doing Something about
The Problem, as opposed to Donald which is trying to keep everything a
secret over on the mailing lists.  (I'm trying to minimize what he has to
deal with, and trying to minimize change points in the mainline kernel.)

The bottom line is that for a well established code you should establish
what the actual bug is.  That means being able to reproduce it at will, and
having a good explaination of how it is occuring.  Ideally you should
measure or directly demonstrate what is happening.

There are things that mask bugs, but don't fix them.  Putting in locks, or
randomly reordering the code frequently has this effect.  Locks, especially,
slow the code down and can reduce the symptom frequency without removing the
true problem.

> of three Linux boxes and one NT, the best I can get is 240 loops, in the
> DownStall in boomerang_start_xmit().  3c905B.   Still much higher than
> we expect.

Quick test: histogram of the loop count.

Donald Becker				becker@scyld.com
Scyld Computing Corporation
410 Severn Ave. Suite 210
Annapolis MD 21403


-------------------------------------------------------------------
To unsubscribe send a message body containing "unsubscribe"
to linux-vortex-request@beowulf.org