FIX: 0.99L and timeouts
Andrew Morton
andrewm@uow.edu.au
Thu Apr 20 11:31:28 2000
Bogdan Costescu wrote:
>
> On Thu, 20 Apr 2000, Andrew Morton wrote:
>
> > It's not an interrupt race. spin_lock_irqsave() disables interrupts on
> > the local CPU and also grabs the spinlock, so the local CPU can't take
> > an interrupt and any other CPU will spin on the lock on entry to the ISR
> > until the local CPU releases the lock.
> ....
> > TxIntrUploaded bit. This interrupt will be pending, but the local CPU
> > won't actually take it until it hits the spin_unlock_irqrestore() which
> > reenables local interrupts. (Another CPU may take it earlier and spin
> > on the spinlock in the ISR though).
>
> IMHO, the problem is that if other CPU takes the interrupt, it computes
> entry and prev_entry based on vp->cur_tx which are _before_ the spinlock.
Why is this a problem? The ISR doesn't change the value of cur_tx.
cur_tx is only altered within the spinlock, where the current CPU has
complete control.
> > netperf is useful: www.netperf.org. It simply measures one-way TCP
> > traffic.
>
> I used ttcp which has similar capabilities. I haven't observed any major
> difference in results obtained with ttcp between 0.99L, 3Com's and your
> driver. However, using 3Com's driver produces worse results than 0.99L for
> our parallel codes, while 0.99L and your driver are very close.
> One reason for using a parallel job for testing is that the CPU is loaded
> along with the network, which might help in triggering SMP races.
>
> Ooooo.. bad news! It seems that with your driver (and DownUnstall moved) I
> can get from time to time frozen systems. I delayed this message to be
> sure that I'm able to reproduce it and I got another computer frozen
> (which denies the possibility of a sudden hardware problem). This happens
> only under load of a parallel job and happens only from time to time: I
> was able to run several times our short parallel test (about 10 minutes),
> but afterwards it froze. I should add that I left over night a flooding
> ping working on 2 pairs of computers and I got all 4 happily chewing
> packets this morning...
Oh dear. Perhaps set 'debug=1'?
Also suggest you put a big printk() in vortex_rx() - it should never be
called, and we're _technically_ still in voliation of the specs:
Page 122, bit [4]:
"This bit is automatically acknowledged by the upload
engine as it uploads packets. Drivers should disable this
interrupt and mask this bit when reading IntStatus."
We don't mask it - we still test it, although we are disabling it in the
interrupt enable reg.
In fact, you could just remove the lines:
if (status & RxComplete)
vortex_rx(dev);
from vortex_interrupt.
And finally, in vortex_interrupt:
if (status & TxAvailable) {
if (vortex_debug > 5)
printk(KERN_DEBUG " TX room bit was handled.\n");
/* There's room in the FIFO for a full-sized packet. */
outw(AckIntr | TxAvailable, ioaddr + EL3_CMD);
clear_bit(0, (void*)&dev->tbusy);
mark_bh(NET_BH);
}
Put a big printk in here. The test should never be true. I actually
split the ISR into two for the 2.3 driver for these reasons, and cache
footprint. This eliminates the option of having full_bus_master_rx and
not full_bus_master_tx (and vice versa), but this doesn't happen.
--
-akpm-
-------------------------------------------------------------------
To unsubscribe send a message body containing "unsubscribe"
to linux-vortex-bug-request@beowulf.org