FIX: 0.99L and timeouts

Bogdan Costescu Bogdan.Costescu@IWR.Uni-Heidelberg.De
Thu Apr 20 10:56:31 2000


On Thu, 20 Apr 2000, Andrew Morton wrote:

> It's not an interrupt race.  spin_lock_irqsave() disables interrupts on
> the local CPU and also grabs the spinlock, so the local CPU can't take
> an interrupt and any other CPU will spin on the lock on entry to the ISR
> until the local CPU releases the lock.
....
> TxIntrUploaded bit.  This interrupt will be pending, but the local CPU
> won't actually take it until it hits the spin_unlock_irqrestore() which
> reenables local interrupts.  (Another CPU may take it earlier and spin
> on the spinlock in the ISR though).


IMHO, the problem is that if other CPU takes the interrupt, it computes
entry and prev_entry based on vp->cur_tx which are _before_ the spinlock.

> netperf is useful: www.netperf.org.  It simply measures one-way TCP
> traffic.

I used ttcp which has similar capabilities. I haven't observed any major
difference in results obtained with ttcp between 0.99L, 3Com's and your
driver. However, using 3Com's driver produces worse results than 0.99L for
our parallel codes, while 0.99L and your driver are very close.
One reason for using a parallel job for testing is that the CPU is loaded
along with the network, which might help in triggering SMP races.

Ooooo.. bad news! It seems that with your driver (and DownUnstall moved) I
can get from time to time frozen systems. I delayed this message to be
sure that I'm able to reproduce it and I got another computer frozen
(which denies the possibility of a sudden hardware problem). This happens
only under load of a parallel job and happens only from time to time: I
was able to run several times our short parallel test (about 10 minutes),
but afterwards it froze. I should add that I left over night a flooding
ping working on 2 pairs of computers and I got all 4 happily chewing
packets this morning...
I know that RedHat identified a problem with 2.2.14 kernels under CPU
load, I will try to patch my kernel and see if the problem is still there.

> Which card, precisely? (0x10B7, 0x9200)?

Yes.

> I would have thought that Tornado has NWAY.  This probably explains a
> few things.

It's documented on page 20. But before having the docs, I just thought:
why would 3Com remove something useful? Also, if I'm using 3Com's driver,
the card is initialized in FD mode and vortex-diag reports
Autonegotiate, so this is clearly possible...

> mm..  Auto-neg is pretty notorious for not working right.

Then maybe I'm just lucky! I have my computers connected using 3Com, tulip
based and SMC (epic100) based cards to several switches and I never had
any problems with autonegotiation... In this particular case, I would
expect highly engineered (and expensive!) products as 3Com cards and
BayNetworks switches to properly implement the standards.

Sorry about the collisions, I don't have any ideea...

Sincerely,

Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu@IWR.Uni-Heidelberg.De

-------------------------------------------------------------------
To unsubscribe send a message body containing "unsubscribe"
to linux-vortex-bug-request@beowulf.org