Abit BP6 stability was: Stability of network cards at 75MHz+

Robert G. Brown rgb at phy.duke.edu
Tue May 16 08:04:45 PDT 2000

On Tue, 16 May 2000, Anand Kumria wrote:

> On Mon, May 15, 2000 at 11:16:33PM -0400, Robert G. Brown wrote:
> > 
> > A further warning to add to Doug's:  If you have any interest in running
> > the BP6 nodes with UDMA enabled, you CANNOT overclock.  I believe that
> > Mark Hahn noted that in a comment just yesterday or the day before.  You
> > will completely screw up critical timing if you do.
> > 
> > I personally run my BP6's at their nice, pedestrian spec speed and they
> > are very nice and reliable.
> I am running mine at its pedestrian spec as well but I'm seeing lockups every
> 14 days or so. The machine locks hard so that even a magic sysreq doesn't 
> work.
> I am running a DFE-550TX network card and some low end video card as 
> additional cards on the motherboard. I've started to suspect the BIOS
> is shutting the machine down because of an overheat state -- should
> I be using something in Linux to monitor the internal temprature?

There are ah, so many ways that this could happen.  Which kernel are you
using (presumably SMP)?  Are there any messages?  I was rock solid with
2.2.13 and did fine with 2.2.14-5.0smp (Red Hat 6.2) but I went to
2.2.15 when I rebuilt a kernel to use lm-sensors with (see next) AND
have been trying to use ide-scsi to run my CD-RW drive, and something in
that combination is poisonous and I've locked a couple of times.

The point being that sometimes particular kernels, or particular kernels
configured in certain ways, can be unstable on a given hardware
combination.  Your two responses in this case are to:

   a) Participate in the debugging.  Join the kernel-smp list, read its
archives and FAQ and see if you can help debug whatever it is on your
system that is broken.This is the "right way" to proceed, but
not everybody has the time or skill to do it.

   b) Try different kernels and/or hardware combos to see if it
stabilizes.  Hardware is so cheap these days that it is often less of a
hassle to simply swap out anything suspect until things clear.  I
personally would suspect either your network card (you might try a $30
Netgear F310, which works fairly well for me) but it might well be your
kernel revision per se.  I'd heard Bad Things about 2.2.15 before I
tried to install it; I'm going to likely try to fall back to 2.2.14
myself (or back to my known-happy 2.2.13).

Finally, your point about overheating is well taken, although it is part
of the larger context of b).  The cruel fact of the matter is that cheap
COTS hardware has a relatively high failure rate, and that not all
failures are obvious "puff of smoke"/"won't boot at all" types of
occurrences.  I've had an abit system that "just wasn't right".  It'd
boot all right (usually) and once booted it would usually run, but then
a few days or week later the system would just crash, and when it did I
might or might not successfully reboot.  The kind of thing you'd pull
your hair out about and rave against linux for -- if it weren't for the
fact that I had an identical box with identical hardware humming right
along sitting next to it with nary a problem from the day I installed it
to now.

I played swap-the-component (fortunately I have a pretty large stack of
extra video cards, NICs, and so forth to play with) and it turned out
that it was the AGP video card.  Returned it to the store, put in a PCI
one (I know, AGP probably would work fine, but I've had the damnedest --
bad -- luck with AGP cards) and the system is like a rock.

To investigate your hypothesis about temperature, you can download and
install the lm_sensors package from http://www.netroedge.com/~lm78.
This will (unfortunately) require that you build a kernel -- the
lm_sensors package won't build as a standalone module without access to
the include files built by your actual running kernel.  This makes
installation into a RH system a bit of a pain.  If you persevere and do
this, you can use the included "sensors" command to get snapshots of the
system temperatures and voltages (where the latter are also important --
they can indicate that your power supply is failing or inadequate, and
this is actually one of the most likely causes of hardware failure).  Or
you can use procstatd/watchman to monitor temperatures and fans (not
voltages, sorry) in real time to see if temps are indeed creeping up
prior to system death.  I can guarantee that procstatd works with
lm_sensors 2.5.0 on an Abit BP6 (since I have one) and some other Abit
motherboards, although there are still lots of lm_sensors based systems
that it hasn't been tested on and may bomb on.



Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

More information about the Beowulf mailing list