2.0 kernels, tulip driver, crashes and reboots (long)

Fri Jan 8 10:44:16 1999

On Thu, 7 Jan 1999, Al Youngwerth wrote:

> We make an embedded system that uses linux and a headless PC. We're trying
> to qualify a new hardware platform that use a VIA VPX based motherboard
> (from Epox), Intel Pentium 133, 16MB RAM, and a PNIC-based 10/100 tulip
> clone. Part of our qualification testing is to get a bunch of systems
> running in a room without any crashes or spontaneous reboots for over two
> weeks. We've been having some trouble.
...

Two or three remarks:

  a) If the systems crash (lockup or not) with different network cards
in them (you cited both ne2000's and PNIC tulips) then it is not too
likely that network cards are the source of the crash.  The two drivers
don't share much code and a P133 isn't exactly a high stress
environment.

  b) The PNIC cards with the tulip driver may well be unstable -- I've
never tried them.  However, NE2K cards and "true" 21140 tulip cards are
awesomely stable.  I routinely achieve 100+ day uptimes with every
kernel after 2.0.33, and was well on my way to 100 days with 2.1.131
before I decided to convert to Red Hat on the system in question and
haven't spent the time to figure out how to build/install 2.1.x under RH
since.  This is in a far more demanding environment -- SMP systems with
very heavy CPU and network loads.  You can always swap in a true tulip
card or five and get statistics on them.  But as I said, I doubt that
your problem is the network card.

  c) To me, your symptoms sound like a very low level configuration
problem of one sort or another -- perhaps a BIOS or memory issue, or, as
you note, an APM issue.  linux runs stably on way to many P5/P6 systems
to make it likely that there is a serious problem with the kernel
itself, but certain hardware combinations or BIOS setups can certainly
destabilize the system.  Is the system caching anything at the
hardware/bios level?  This can be a problem.  What do your
/proc/[ioport,interrupt,device,pci] look like?

  d) Another possibility is that your memory itself is marginal.  The
absolute worst cases of hardware debugging I have encountered have
centered on bad/marginal memory.  We recently acquired a dual 450 MHz
PII system that crashed every time we put a significant load on it and
crashed anyway (after a longer time) even WITHOUT a load on it -- mean
uptime before a crash of perhaps a day or two (an hour or less under
load).  We were pulling out our hair -- we tried swapping cards, CPUs,
and were close to trying another motherboard when we decided to swap
SDRAM DIMMS instead.  Turned out our "certified" PC100 memory sucked --
we put in over the counter PC100 memory from a local vendor and have had
zero crashes under load or otherwise.  Sounds like you got all of those
motherboards at once from somebody, and presumably got the same memory
on all of them.  You might try getting some memory from a DIFFERENT
(EDO?)  vendor and swapping it into a few of the systems and see if they
crash.  Some motherboards are far less tolerant than others of "bad" or
marginally spec'd memory.

  e) The final possibility (that I can think of) is to see if your
problem is peculiar to the Epox MoBo you are using, if it is at all
possible to swap it for another "equivalent".  Usually fundamentally
stable vs unstable is very easy to identify even with small samples.
Just because a MoBo has been used successfully with Windoze is no reason
to believe that it is reliable -- Windows typically has a mean uptime
measured in days under load anyway, so hardware problems are "invisible"
against the dominant software problems (memory leaks and so forth).
Even NT is none too stable for the purposes of validating hardware.
I've used linux on Pentia, AMD's, PPro's, PII's, Celerons in single and
dual configurations (probably several hundred systems total in a couple
dozen different hardware configurations) and have literally never
encountered a system (yet) on which I could not totally stabilize it,
but there is always a first time.  It may be that your motherboard has
some feature that just won't work with linux unless/until you hack the
kernel itself.

  Hope this helps, and hang in there.  As I said, if you persevere (and
eliminate any possible hardware/bios problems by systematic swaps and
the process of elimination) you have an excellent chance of beating the
problem.

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu