Problems with dual Athlons
Mark Hahn
hahn at physics.mcmaster.ca
Wed Jul 31 08:15:12 PDT 2002
> We have 4 dual athlon systems running kernel 2.4.18 (gcc 2.95.2).
it's definately worth your while to try a more recent kernel
(ie, 2.4.19-rc3, possibly the latest ac or aa version.)
> Two of them crash frequently and the other two run fine.
> We have tried to replace different hardware components and desactivate
> the SMP option but the problem persists.
how seriously have you addressed the hardware explanation?
for instance, have you verified that the CPU fans are mounted
properly? is there any temperature correlation to when the crashes
happen? do you have a reason to believe the dimms are good?
how about bios settings (esp wrt memory timings) and/or bios versions?
how about power supplies? it's useful to have a "monster" 450W
PS from a name-brand like Enermax around that you know is good,
but really only use for testing.
> The main difference between them is that the systems that crash
> (the servers) have two network interfaces while the systems that run
> fine (normal nodes) have only one network interface.
bonding? incidentally, are you using MPS 1.4 and kernel apic support?
> Can this be the cause of the problem ? Would it be a good idea to use
> another version of gcc ?
2.95.2 is still recommended for 2.4 I believe. I recall AC saying that
it had some trouble with 2.5 though.
> The motherboard is an ASUS AM7M266-D. One of the systems that
> crashes is running Debian 2.1 and the other Debian 2.2. The systems
> that don't crash run Debian 2.1.
I don't see why userspace would matter.
> "Crash" here means that the VGA display is blank and the system has to
> be reseted. There is no other relevant message.
consider first turning off the blanking console screensaver,
and possibly running a serial console for logging purposes.
I assume you also mean that magic-sysrq doesn't work either.
I find this normally implicates system-level HW problems
(heat, power, etc)
regards, mark hahn.
More information about the Beowulf
mailing list