DnStall period
Andrew Morton
andrewm@uow.edu.au
Fri May 12 23:36:54 2000
This is a multi-part message in MIME format.
--------------08CE7ECBAEAFCE530240B68C
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
It is definitely associated with collisions.
My test setup of several weeks ago showed this nicely. I had four
machines on a hubbed 10bT LAN:
Machine A: cs89x0
Machine B: 3c575
Machine C: 3c905B
Machine D: ne2k
pingflood A from B
pingflood B from A
pingflood D from C
machine C is the test machine.
With this setup I was seeing the wait_for_completion() function fall
through after 2,000 loops several times a minute. I was also frequently
seeing the maxCollisions threshold exceeded.
I no longer have machine B, and my current setup doesn't exhibit things
as well.
Still, I have seen the wait_for_completion count hit several hundred
quite often. It is always when called from the DownStall in
boomerang_start_xmit(). A few times I have even seen the 4,000 count
exceeded.
For some reason, using a ping packet size of 24 on the test machine
exacerbates the problem. 'ping -s 24'.
I have attached to this email my debug version of
wait_for_completion(). I am using the 2.3.99-pre7 driver.
http://www.uow.edu.au/~andrewm/linux/3c59x.c-2.3.99-pre7-1-1.gz (I think
this is unchanged...)
Now, let's look at the output:
Here is a typical one: 106 loops
wfc(1743): 106
status=0xe000, txfree=0x0008, downlistptr=0x078902e0, dnmaxburst=0
cur_tx=498879, dirty_tx=498869, tx_full=0
vp->stats.tx_carrier_errors=2
vp->stats.tx_heartbeat_errors=0
vp->stats.collisions=46782
vp->stats.tx_window_errors=0
vp->stats.rx_fifo_errors=0
vp->stats.tx_packets=496692
vp->stats.tx_packets=496692
Here is a rare one: the timeout was exceeded (4,000 loops):
eth0: command 0x3002 did not complete! Status=0xf000
wfc(1743): 4001
status=0xf000, txfree=0x0008, downlistptr=0x07890220, dnmaxburst=0
cur_tx=495358, dirty_tx=495346, tx_full=0
vp->stats.tx_carrier_errors=2
vp->stats.tx_heartbeat_errors=0
vp->stats.collisions=46300
vp->stats.tx_window_errors=0
vp->stats.rx_fifo_errors=0
vp->stats.tx_packets=492926
vp->stats.tx_packets=492926
In all cases, txfree has a very small value: 0x8, 0xc, etc. This means
that the Tx FIFO is almost full.
So, my theory:
- The NIC has started to transmit a packet.
- The next packet in memory is, say, 64 bytes.
- The NIC sees >32 bytes spare in the FIFO (but <64).
- The NIC transfers 32 bytes from main memory.
- Collisions start happening, and force the NIC to
resend the current packet an arbitrary number of times.
- During this process, we issue a DnStall.
The NIC simply has nowhere to go. It can't honour the DnStall because
it's halfway through processing a DPD. It can't free up room in the
FIFO because it has to hang onto the head packet for retransmission.
I would like to be able to query the NIC's current internal DMA address
pointer. Can't see a way of doing this.
I would like to know what DMA burst sizes the NIC is using. Can't see
any reference to this. Is this a PCI thing?
Now, as Bogdan points out, recovery from this situation is tricky. If
we simply let the loop counter expire and proceed as if the command has
completed we get into race conditions writing DPDs in main memory - the
current download will eventually complete and the NIC can scribble on
values which the CPU is writing to memory.
I _think_ the best thing to do is to simply keep spinning. If we hit
some simply ridiculously large timeout then presumably the collision
state is permanent and there is some broken equipment on the LAN. Doing
a panic() here is a bit rude - we need to do a global reset and back
out. Callers of wait_for_completion() need to be taught that it can
fail and need to back out gracefully. Messy.
Bogdan, I am at a loss to explain why increasing the loop count from
2,000 to 4,000 changed anything for you. You're on switched 100bT,
right? You shouldn't be getting _any_ collisions (and when in full
duplex mode the NIC doesn't even look for collisions). So what's going
on?
I suspect that you're mistaken and that upping the loop counter was not
the source of your success.
Can you please drop my debug wait_for_completion() into your driver and
let us know what happens? Thanks.
--
-akpm-
--------------08CE7ECBAEAFCE530240B68C
Content-Type: image/x-xbitmap;
name="wfc"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="wfc"
#define wait_for_completion(dev, cmd) _wait_for_completion(dev, cmd, __LINE__)
static void _wait_for_completion(struct net_device *dev, int cmd, int _line)
{
int i = 0;
long ioaddr = dev->base_addr;
struct vortex_private *vp = (struct vortex_private *)dev->priv;
outw(cmd, dev->base_addr + EL3_CMD);
while (i++ < 4000)
{
if (!(inw(ioaddr + EL3_STATUS) & CmdInProgress))
{
if (i > 10)
goto whoops;
return;
}
}
printk(KERN_ERR "%s: command 0x%04x did not complete! Status=0x%x\n",
dev->name, cmd, inw(dev->base_addr + EL3_STATUS));
whoops:
{
int old_window;
unsigned short status;
unsigned short txfree;
unsigned long downlistptr;
int dnmaxburst;
old_window = inw(ioaddr + EL3_CMD) >> 13;
status = inw(dev->base_addr + EL3_STATUS);
EL3WINDOW(3);
txfree = inw(ioaddr + 12);
downlistptr = inl(ioaddr + DownListPtr);
dnmaxburst = (inw(ioaddr + 0x78) >> 5) & 0x3f;
outw(0, ioaddr + 0x78);
printk("wfc(%d): %d\n", _line, i);
printk("status=0x%04x, txfree=0x%04x, downlistptr=0x%08lx, dnmaxburst=%d\n",
status, txfree, downlistptr, dnmaxburst);
printk("cur_tx=%d, dirty_tx=%d, tx_full=%d\n",
vp->cur_tx, vp->dirty_tx, vp->tx_full);
EL3WINDOW(6);
#define P(s) printk(#s "=%ld\n", s);
P(vp->stats.tx_carrier_errors);
P(vp->stats.tx_heartbeat_errors);
P(vp->stats.collisions);
P(vp->stats.tx_window_errors);
P(vp->stats.rx_fifo_errors);
P(vp->stats.tx_packets);
P(vp->stats.tx_packets);
#undef P
EL3WINDOW(old_window);
}
}
--------------08CE7ECBAEAFCE530240B68C--
-------------------------------------------------------------------
To unsubscribe send a message body containing "unsubscribe"
to linux-vortex-request@beowulf.org