[vortex] Strange problem with 3c905A

Ville Herva vherva@mail.niksula.cs.hut.fi
Mon, 6 Aug 2001 08:55:25 +0300


On Sun, Aug 05, 2001 at 05:48:30PM -0700, you [Andrew Morton] claimed:
> Ville Herva wrote:
> > 
> > [I'll read the archive, but please Cc me]
> > 
> > kernel 2.2.8pre19 + it's stock 3c59x.c, SMP, 3c905A.
> 
> 2.2.8?  Or 2.2.18??  Please confirm...

2.2.18pre19 of course, I'm sorry.
 
> Your APIC (the thing which controls interrupts on SMP) has
> lost its brains, and no interrupts are being delivered.

Hmm. Makes sense. I have an Abit BP6 mobo, and while it currently seems
stable (with the newest bios) I had great trouble getting it behave when I
bought it. And I wasn't the only one -- it seemed just about everyone had
some problem getting it do SMP properly.

I recall having problems more than a year ago that might also be explained
by interrupts not being delivered. With an earlier bios (and when I
experimented with overclocking the thing -- not so successfully) the box
would go on seemingly stablely for 30 days and so and then, all of a sudden,
would seem to refuse to do any IO. Mouse was all jerky, it reacted somehow
to keyboard, but everything was really sticky. And if the pc honker beep
would go on, it would never stop. Also the HD lights were on all the time.
Reboot never succeeded cleanly - it seemed like hd io never completed.
 
> Donald's drivers have a sneaky pseudo-polling fallback mode
> which allows them to continue to limp along when interrupts
> aren't being delivered, which is why the interface still
> responds to some pings.  The idea here is that there's enough
> throughput to allow you to telnet in and reboot.

Nice! Unfortunately, at  50%-90% drop rate I wasn't able to ssh in...
 
> Nobody *really* seems to have a 100% explanation for this, but
> it's certainly the case that it's due to a race between the
> linux disable_irq() function and the delivery of a hardware
> interrupt.  Drivers which don't use disable_irq() will never
> experience this.  It only happens on SMP.
> 
> There's a fix in 2.4.x-ac kernels which works just fine.
> It's *still* not in Linus' 2.4.x kernels (grump).

I haven't been that keen on jumping to 2.4 on this box as it took a long
while to find a kernel that is stable with this hw.

If this was easy to reproduce, I'd give it a shot.
 
> I have seen very few if any reports of it occurring on 2.2.x kernels.
> 
> One fix is of course to stop using disable_irq().  I'd rather
> not have to do that - it's quite convenient to use this function
> in the media timer handler.
> 
> You can make it stop happening by booting with the `noapic' LILO
> option.

I'll these try if I can reliably reproduce; so far it was the first incident
with this mobo/nic combo, and I've been running it for more than two years.

Thanks!


-- v --

v@iki.fi