Update: 2.0 kernels, tulip driver, crashes and reboots (long)

Al Youngwerth alberty@apexxtech.com
Mon Jan 25 16:29:51 1999


Sorry about the long delay on the update. I had to run off to the U.K. last
week and got my laptop stolen at Heathrow, making communications difficult
for me.

Just as a reminder, we were seeing random, occasional lockups in our VIA
VPX motherboard based systems. Through testing, the lockups appeared to be
related to kernel versions after 2.0.31 and perhaps various tulip drivers.

After finally getting the right equipment for the job (a Tektronix TLA 704
logic analyzer, my nominee for best Win95 application ever), we were able
to catch an instruction trace of the failure mode.

Turns out that when the system locks-up, the processor is in SMM (System
Management Mode) with no instruction fetches and what appear to be fairly
random bus cycles. It didn't seem to make much sense that we were in SMM
mode because we had APM disabled in the BIOS (more on this later).

With this data we were able to setup an end trace on the failure condition
to see what led up to the crash. The next crash trace showed an SMI (System
Management Interrupt) being serviced correctly, resumed, then a whole bunch
of OS/Application code executing, then another SMI and the crash. In this
last SMI, when the processor goes to fetch the SMI handler instructions, it
pulls garbage out of RAM and consequently the processor goes into the weeds
(locks or reboots).

Working with the motherboard vendor and reading the schematics and the data
sheet for the chipset, this should not happen. The SMI code lives in system
RAM in the same place video RAM is normally decoded. At startup, the BIOS
programs the chipset to map system RAM into the space normally occupied by
video RAM to hold the SMM code. After BIOS copies the SMM code into system
RAM, it sets a register in the chipset to protect this RAM so it can only
be read from or written to when in SMM mode. When an SMI is generated, the
chipset then maps in the system RAM (over the video RAM) to execute the SMM
code. So, if the SMM code is getting whacked, it has to be the fault of the
SMM code itself (bad self modifying code) or the chipset or motherboard
design (improper decoding of the memory space).

So then we get back to, "why are we generating SMIs in the first place if
APM is disabled?" Turns out the USB on the motherboard uses SMM to poll for
dumb devices like keyboards. (With the Award BIOS, if you plug in a regular
keyboard, it seems to shut off the USB polling.)

We still don't know exactly what causes the problem or why different
kernels seem to affect the problem. What we do know is that with APM and
USB disabled we have no problems.

We are still working with Epox (the motherboard vendor) to find the real
problem. When we find it, I'll post to the list. In the meantime, if you're
seeing lockups with a VIA VPX based motherboard, try turning off APM and
USB support in the BIOS setup and your problems should go away. If you
suspect you are seeing the problem, you can confirm it by very carefully
probing the SMIACT# signal (pin 58) on the VIA VT82C585VPX chipset with a
DMM. If it's stuck low, you've got the problem.

Thanks to everyone for their help and suggestions on this one.

Cheers,

Al Youngwerth
alberty@apexxtech.com