[Beowulf] Tyan 2466 crashes, no obvious reason why

David Mathog mathog at mendel.bio.caltech.edu
Sun Sep 5 13:32:06 PDT 2004

After a few more crashes with nothing in the log files a shell
script was run that logged all sensors readings every 10 seconds
to a file.  When it next crashed (6 hours after a restart) there
was no significant difference between any of the numbers, be
they voltage, RPM, or Temp.  

I would have expected that if the power supply or on board
voltage regulator was flaking out it would most likely result
in noise showing up in sensors - but it didn't.

This time I also left a monitor plugged into the node
and was greeted by this message on the down machine:

CPU 0:  Machine Check Exception: 000000000000004
Bank 0: e67aa00000000833 at 000000003f9c8688
Bank 1: f600200000000853 at 00000000001ab948

Kernel panic CPU context corrupt
In interrupt handler - not syncing

That message must be new though, because when I plugged in
that monitor the system had recently crashed, and there
was nothing on the screen then.

The motherboard capacitors have all been visually inspected
and none of them are leaking, bulging, or otherwise showing
signs of failure.

memtest86 is running now (and for the next 36 hours or so) but
if it doesn't find anything, does the console error suggest
a region of memory to test more intensively, or a particular test
to run in memtest86???

Looks like I'm going to need a bunch of spare parts for a "fun"
game of "swap components and wait for the crash"...


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

