[Beowulf] Tyan 2466 crashes, no obvious reason why
Joshua Baker-LePain
jlb17 at duke.edu
Fri Sep 3 12:01:30 PDT 2004
On Fri, 3 Sep 2004 at 11:31am, David Mathog wrote
> One of 20 identical nodes containing
>
> Tyan 2466
> Single Athlon MP 2200+
> 1GB ECC memory
>
> is starting to flake out.
>
> For no apparent reason it just drops dead (as far as
> linux is concerned) after a few minutes to a few days.
> At that point the network is down, the serial lines are down,
> and near as I can tell the OS just blew up. There is zip,
> nothing, nada in the log files to indicate a problem.
I haven't put the time into this yet that you have, so this is more of a
"me too" than anything else. But, FWIW, I have 12 similar nodes, and have
had several of them start doing this. When the first one died, I swapped
its RAM with a "known good" node, and that worked for a while (that is,
the problem followed the RAM, so I thought I'd found the culprit). But,
eventually, it started happening to the original node again. And then
another. And then...
One thing I'd say is that 10 min worth of memtest86 (or, better yet,
memtest86+) is not enough. Run it over the weekend and see if it catches
anything.
Good luck.
--
Joshua Baker-LePain
Department of Biomedical Engineering
Duke University
More information about the Beowulf
mailing list