SMP+Tulip lockup, 2.0.36pre15, help requested

Patrick J. LoPresti patl@cag.lcs.mit.edu
Wed Nov 4 11:08:17 1998


A couple of weeks ago, we upgraded our master server from Linux 2.0.35
(tulip.c version 0.88) to 2.0.36pre15 (tulip.c version 0.89H).

Ever since 2.0.33 or so, we have occasionally seen the following
message logged:

  eth0: Re-entering the interrupt handler with proc 1, proc 1 already handling.

...but the system always seemed to continue working fine.

Since installing 2.0.36pre15 10 days ago, we had not seen that message
until last night.  Then the message started spewing to the console
repeatedly (very, very fast) and the system locked up completely.  It
required a hard reset to recover.

The system is a dual 300MHz Pentium II on an Asus P2L97-DS
motherboard.  That's the 440LX chipset with on-board AIC-7880 SCSI.
The ethernet card is a Netgear FA-310TX with a true DEC chipset (i.e.,
a slightly older Netgear).

Perhaps relevant, perhaps not: The SCSI controller and the Netgear
card are sharing IRQ 10.

I note with some interest the following diff between tulip.c versions
0.88 and 0.89H:

 -----

@@ -2119,8 +2217,8 @@
 			   tp->smp_proc_id, hard_smp_processor_id());
 #else
 		printk(KERN_ERR "%s: Re-entering the interrupt handler.\n", dev->name);
-		return;
 #endif
+		return;
 	}
 	dev->interrupt = 1;
 #ifdef SMP_CHECK

 -----

That is, the code path which generates the dreaded message used to
fall through to the following code, but now returns immediately.  I do
not know kernel internals well enough to know whether or how this
could cause the problem.

This is a production system, so I would really appreciate a quick
solution...  Should we downgrade tulip.c?  Downgrade to 2.0.35?
Upgrade to 2.0.36pre16?

Thanks!

 - Pat

P.S.  I would be happy to provide additional information available
upon request, of course.