[Beowulf] Tyan 2466 crashes, no obvious reason why

Joshua Baker-LePain
Fri Sep 3 12:01:30 PDT 2004

On Fri, 3 Sep 2004 at 11:31am, David Mathog wrote

> One of 20 identical nodes containing
> Tyan 2466
> Single Athlon MP 2200+
> 1GB ECC memory
> is starting to flake out.
> For no apparent reason it just drops dead (as far as
> linux is concerned) after a few minutes to a few days.
> At that point the network is down, the serial lines are down,
> and near as I can tell the OS just blew up.  There is zip,
> nothing, nada in the log files to indicate a problem.

I haven't put the time into this yet that you have, so this is more of a
"me too" than anything else.  But, FWIW, I have 12 similar nodes, and have 
had several of them start doing this.  When the first one died, I swapped 
its RAM with a "known good" node, and that worked for a while (that is, 
the problem followed the RAM, so I thought I'd found the culprit).  But, 
eventually, it started happening to the original node again.  And then 
another.  And then...

One thing I'd say is that 10 min worth of memtest86 (or, better yet, 
memtest86+) is not enough.  Run it over the weekend and see if it catches 

Good luck.

Joshua Baker-LePain
Department of Biomedical Engineering
Duke University

