FIX: 0.99L and timeouts

Andrew Morton andrewm@uow.edu.au
Thu Apr 20 11:31:28 2000


Bogdan Costescu wrote:
> 
> On Thu, 20 Apr 2000, Andrew Morton wrote:
> 
> > It's not an interrupt race.  spin_lock_irqsave() disables interrupts on
> > the local CPU and also grabs the spinlock, so the local CPU can't take
> > an interrupt and any other CPU will spin on the lock on entry to the ISR
> > until the local CPU releases the lock.
> ....
> > TxIntrUploaded bit.  This interrupt will be pending, but the local CPU
> > won't actually take it until it hits the spin_unlock_irqrestore() which
> > reenables local interrupts.  (Another CPU may take it earlier and spin
> > on the spinlock in the ISR though).
> 
> IMHO, the problem is that if other CPU takes the interrupt, it computes
> entry and prev_entry based on vp->cur_tx which are _before_ the spinlock.

Why is this a problem?  The ISR doesn't change the value of cur_tx. 
cur_tx is only altered within the spinlock, where the current CPU has
complete control.


> > netperf is useful: www.netperf.org.  It simply measures one-way TCP
> > traffic.
> 
> I used ttcp which has similar capabilities. I haven't observed any major
> difference in results obtained with ttcp between 0.99L, 3Com's and your
> driver. However, using 3Com's driver produces worse results than 0.99L for
> our parallel codes, while 0.99L and your driver are very close.
> One reason for using a parallel job for testing is that the CPU is loaded
> along with the network, which might help in triggering SMP races.
> 
> Ooooo.. bad news! It seems that with your driver (and DownUnstall moved) I
> can get from time to time frozen systems. I delayed this message to be
> sure that I'm able to reproduce it and I got another computer frozen
> (which denies the possibility of a sudden hardware problem). This happens
> only under load of a parallel job and happens only from time to time: I
> was able to run several times our short parallel test (about 10 minutes),
> but afterwards it froze. I should add that I left over night a flooding
> ping working on 2 pairs of computers and I got all 4 happily chewing
> packets this morning...

Oh dear.  Perhaps set 'debug=1'?

Also suggest you put a big printk() in vortex_rx() - it should never be
called, and we're _technically_ still in voliation of the specs:

Page 122, bit [4]:

"This bit is automatically acknowledged by the upload
engine as it uploads packets. Drivers should disable this
interrupt and mask this bit when reading IntStatus."

We don't mask it - we still test it, although we are disabling it in the
interrupt enable reg.

In fact, you could just remove the lines:

        if (status & RxComplete)
            vortex_rx(dev);
 
from vortex_interrupt.

And finally, in vortex_interrupt:

        if (status & TxAvailable) {
            if (vortex_debug > 5)
                printk(KERN_DEBUG " TX room bit was handled.\n");
            /* There's room in the FIFO for a full-sized packet. */
            outw(AckIntr | TxAvailable, ioaddr + EL3_CMD);
            clear_bit(0, (void*)&dev->tbusy);
            mark_bh(NET_BH);
        }

Put a big printk in here.  The test should never be true.  I actually
split the ISR into two for the 2.3 driver for these reasons, and cache
footprint.  This eliminates the option of having full_bus_master_rx and
not full_bus_master_tx (and vice versa), but this doesn't happen.

-- 
-akpm-
-------------------------------------------------------------------
To unsubscribe send a message body containing "unsubscribe"
to linux-vortex-bug-request@beowulf.org