[Beowulf] Tyan 2466 crashes, no obvious reason why
mathog at mendel.bio.caltech.edu
Fri Sep 3 11:31:30 PDT 2004
One of 20 identical nodes containing
Single Athlon MP 2200+
1GB ECC memory
is starting to flake out.
For no apparent reason it just drops dead (as far as
linux is concerned) after a few minutes to a few days.
At that point the network is down, the serial lines are down,
and near as I can tell the OS just blew up. There is zip,
nothing, nada in the log files to indicate a problem.
I pulled the unit and monitored it closely and it does not seem
to be an overheating problem: all the fans are spinning
as they should be even after it has crashed. The network
port lights are still flashing. After reboot smartctl shows
no errors on the hard drive. Running sensors every few seconds
in a loop shows nothing odd happening to the voltages or temps
or fan speeds up through the last log point before it dies.
Running memcheck86 for 10 minutes showed no memory errors.
I'm thinking about replacing the power supply (for lack of a better
idea.) What else might be causing this??? There's not much
in these systems, just the one CPU, a floppy, 1GB RAM and a cheap
S3 graphics card (normally not used.)
The other 19 (identical) nodes are working reliably.
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf