2.0 kernels, tulip driver, crashes and reboots (long)

Thu Jan 7 20:12:38 1999

We make an embedded system that uses linux and a headless PC. We're trying
to qualify a new hardware platform that use a VIA VPX based motherboard
(from Epox), Intel Pentium 133, 16MB RAM, and a PNIC-based 10/100 tulip
clone. Part of our qualification testing is to get a bunch of systems
running in a room without any crashes or spontaneous reboots for over two
weeks. We've been having some trouble.

To test the systems, we load up a little test program that blinks LEDs on
the system so we have a visual indication of a system's basic health. We
also added a cron job once a minute so we have a steady log to track
crashes/reboots and then we have a program on the system that can parse the
logs to detect crashes or reboots. All of this information is then uploaded
to a server so we can build a spreadsheet of crashes and reboots.

Here's what we've found: out of 50+ systems running over the past 8 weeks,
everyone of them has crashed (locked-up) or spontaneously rebooted. We have
observed the results of some crashes with a video card and keyboard plugged
in the system, no kernel panic, just a blank screen. The reboots are not
graceful shutdowns, the disk partitions are always dirty. 

We've run many different tests to try to eliminate certain factors and
focus in on others. Whenever we run a different test, we put it on at 10
systems and observe what happens. All systems are essentially idle and not
connected to a network (except on a one by one basis to telnet in to check
the logs). Thermals in the systems are good (ambient room temperature ~32C,
in-case ambient ~36C, top of CPU ~42C). Here's some of the data points
we've taken. 

1) Stock 2.0.35 kernel locks and reboots. A stock 2.0.36 kernel only
reboots. We have loaded 10 systems with a stock 2.0.31 kernel, no reboots
in 2 days (still inconclusive). Average time to reboot for a 2.0.36 kernel
varies between the systems ranging from an average once every 4 days to
once every 23 days (the overall average across all units is once every 8.5
days). Unfortunately, we don't have accurate data on the frequency of the
2.0.35 kernel reboots because we didn't realize they we're rebooting until
we started parsing the logs (after we swtiched our focus to the 2.0.36
kernel). The stock 2.0.35 kernel was crashing on average, across all units,
about once every 12 days.

2) A 2.0.36 kernel with the .90f version of the tulip driver both locks and
reboots. Tulip .89K and .87 only reboot, no lockups (.87 is stock for .35
and .36).

3) 10 systems with ne2000 cards and the 2.0.36 kernel in them reboot but
don't lockup.

4) We don't believe it is related to other software running on the systems.
We took 10 systems down to the point that they were running only inetd (so
we could telnet into them), klogd, and syslogd. We still got reboots.

5) We don't believe it is related to hardware (although not entirely
convinced it isn't). The systems include a mix of different power supplies,
different motherboards (although all VIA/Award bios), different CPUs
(P100s, P133s, and P200MMX), different RAM manufactures, etc. and they all
fail. The reboots could be caused by brown-outs (power good signal going
low will cause a hard reset) but you would expect a bunch to fail at the
same time (they don't). We do have one system set up with a digital o-scope
to trigger on power good dropping below 4.6 volts but no trigger in the
past 9 days (and no reboots on that system).

6) The lock-up probelm in 2.0.35 may be related to APM. When we were
testing the 2.0.35 kernel, we reduced the frequency of lockups from a per
unit rate of once every 12 days to a per unit rate of about once every 70
days by disabling APM in the BIOS. We tested 10 units with APM disabled in
the BIOS with 2.0.36 and they still didn't crash, but their reboot rate
remained the same. I've looked at the diffs in the APM code between 2.0.35
and 2.0.36 and what we are compiling up and there shouldn't be a bit of
difference. We also compiled up a 2.0.36 kernel with a hacked APM module
that logged each APM event, sync'd the disk and then let the event go
through the normal APM code. Put this on 10 systems and they still rebooted
and we never saw any of the APM events in the logs.

Preliminary conclusions:

1) There is a problem with the .90f tulip driver and the PNIC-based 10/100
cards. There may be a problem with other tulip chipsets, but I do not have
any other cards to verify this data. This may also be specific to VIA
chipsets.

2) The 2.0.35 kernel has a lockup problem running on the VIA motherboard
(and perhaps other motherboards). The diff between 2.0.35 and 2.0.36 is
huge, trying to trace the key patch down could take years with this kind of
trial and error testing.

3) Both 2.0.35 and 2.0.36 seem to have a spontaneous reboot problem, we'll
have better data on the 2.0.31 kernel systems in a couple more days. Again
this could be related to hardware, most likely motherboard/bios. We will
have 10 more systems setup tommorrow with Intel TX/Award BIOS. We are also
going to load 10 systems with DOS to see if they reboot or lockup. The
reboot problem also could be environmental, we have 50+ systems in a small
room on one power circuit. Tonight we are moving 10 of those systems and
spreading them around the building.

Any and all comments/ideas appreciated.

Thanks,

Al Youngwerth
alberty@apexxtech.com