2.0 kernels, tulip driver, crashes and reboots (long)

Geoff Thompson geofft@waikato.ac.nz
Thu Jan 7 23:04:23 1999


Al,

I'm possibly not the most qualified person to answer this, but a couple of
ideas spring to mind :

1. What drivers/modules are loaded/built into the kernel.  Perhaps disabling as
much as possible (including APM support if it's enabled) may help.

2. I am not familiar with the motherboards, but running your tests on a
system with a different motherboard chipset may at least narrow down what may
be at fault (since you seem to have problems with a ne2000 ethernet card
anyway).  Have you tried an Intel based motherboard??

3. What Drives are involved IDE/SCSI/None?  Perhaps disbling the
IDE/SCSI interfaces in the BIOS and booting a kernel off a floppy, no swap, 
maybe another good test of your systems.

Just a couple of thoughts for you.

Geoff.


On 07-Jan-99 Al Youngwerth wrote:
> We make an embedded system that uses linux and a headless PC. We're trying
> to qualify a new hardware platform that use a VIA VPX based motherboard
> (from Epox), Intel Pentium 133, 16MB RAM, and a PNIC-based 10/100 tulip
> clone. Part of our qualification testing is to get a bunch of systems
> running in a room without any crashes or spontaneous reboots for over two
> weeks. We've been having some trouble.
> 
> To test the systems, we load up a little test program that blinks LEDs on
> the system so we have a visual indication of a system's basic health. We
> also added a cron job once a minute so we have a steady log to track
> crashes/reboots and then we have a program on the system that can parse the
> logs to detect crashes or reboots. All of this information is then uploaded
> to a server so we can build a spreadsheet of crashes and reboots.
> 
> Here's what we've found: out of 50+ systems running over the past 8 weeks,
> everyone of them has crashed (locked-up) or spontaneously rebooted. We have
> observed the results of some crashes with a video card and keyboard plugged
> in the system, no kernel panic, just a blank screen. The reboots are not
> graceful shutdowns, the disk partitions are always dirty. 
> 
> We've run many different tests to try to eliminate certain factors and
> focus in on others. Whenever we run a different test, we put it on at 10
> systems and observe what happens. All systems are essentially idle and not
> connected to a network (except on a one by one basis to telnet in to check
> the logs). Thermals in the systems are good (ambient room temperature ~32C,
> in-case ambient ~36C, top of CPU ~42C). Here's some of the data points
> we've taken. 
> 
> 1) Stock 2.0.35 kernel locks and reboots. A stock 2.0.36 kernel only
> reboots. We have loaded 10 systems with a stock 2.0.31 kernel, no reboots
> in 2 days (still inconclusive). Average time to reboot for a 2.0.36 kernel
> varies between the systems ranging from an average once every 4 days to
> once every 23 days (the overall average across all units is once every 8.5
> days). Unfortunately, we don't have accurate data on the frequency of the
> 2.0.35 kernel reboots because we didn't realize they we're rebooting until
> we started parsing the logs (after we swtiched our focus to the 2.0.36
> kernel). The stock 2.0.35 kernel was crashing on average, across all units,
> about once every 12 days.
> 
> 2) A 2.0.36 kernel with the .90f version of the tulip driver both locks and
> reboots. Tulip .89K and .87 only reboot, no lockups (.87 is stock for .35
> and .36).
> 
> 3) 10 systems with ne2000 cards and the 2.0.36 kernel in them reboot but
> don't lockup.
> 
> 4) We don't believe it is related to other software running on the systems.
> We took 10 systems down to the point that they were running only inetd (so
> we could telnet into them), klogd, and syslogd. We still got reboots.
> 
> 5) We don't believe it is related to hardware (although not entirely
> convinced it isn't). The systems include a mix of different power supplies,
> different motherboards (although all VIA/Award bios), different CPUs
> (P100s, P133s, and P200MMX), different RAM manufactures, etc. and they all
> fail. The reboots could be caused by brown-outs (power good signal going
> low will cause a hard reset) but you would expect a bunch to fail at the
> same time (they don't). We do have one system set up with a digital o-scope
> to trigger on power good dropping below 4.6 volts but no trigger in the
> past 9 days (and no reboots on that system).
> 
> 6) The lock-up probelm in 2.0.35 may be related to APM. When we were
> testing the 2.0.35 kernel, we reduced the frequency of lockups from a per
> unit rate of once every 12 days to a per unit rate of about once every 70
> days by disabling APM in the BIOS. We tested 10 units with APM disabled in
> the BIOS with 2.0.36 and they still didn't crash, but their reboot rate
> remained the same. I've looked at the diffs in the APM code between 2.0.35
> and 2.0.36 and what we are compiling up and there shouldn't be a bit of
> difference. We also compiled up a 2.0.36 kernel with a hacked APM module
> that logged each APM event, sync'd the disk and then let the event go
> through the normal APM code. Put this on 10 systems and they still rebooted
> and we never saw any of the APM events in the logs.
> 
> Preliminary conclusions:
> 
> 1) There is a problem with the .90f tulip driver and the PNIC-based 10/100
> cards. There may be a problem with other tulip chipsets, but I do not have
> any other cards to verify this data. This may also be specific to VIA
> chipsets.
> 
> 2) The 2.0.35 kernel has a lockup problem running on the VIA motherboard
> (and perhaps other motherboards). The diff between 2.0.35 and 2.0.36 is
> huge, trying to trace the key patch down could take years with this kind of
> trial and error testing.
> 
> 3) Both 2.0.35 and 2.0.36 seem to have a spontaneous reboot problem, we'll
> have better data on the 2.0.31 kernel systems in a couple more days. Again
> this could be related to hardware, most likely motherboard/bios. We will
> have 10 more systems setup tommorrow with Intel TX/Award BIOS. We are also
> going to load 10 systems with DOS to see if they reboot or lockup. The
> reboot problem also could be environmental, we have 50+ systems in a small
> room on one power circuit. Tonight we are moving 10 of those systems and
> spreading them around the building.
> 
> Any and all comments/ideas appreciated.
> 
> Thanks,
> 
> Al Youngwerth
> alberty@apexxtech.com

----------------------------------
Geoff Thompson <geofft@waikato.ac.nz>
University of Waikato,
Hamilton, New Zealand
Ph: (07) 838 4748

Random Quote of the Minute :
These days the necessities of life cost you about three times what they
used to, and half the time they aren't even fit to drink.

----------------------------------