BEOWULF cluster hangs
dtj at uberh4x0r.org
Thu Sep 26 08:26:46 PDT 2002
You might also check what the network is doing.
It may be an issue relating to your application, as I know of a
molecular dynamics app (NAMD) that has a particular pathological case
that exhibits similar behaviour. After many hours of running, the
particular simulation that has the problem would cause the application
to do progressively more "housekeeping" and cpu utilization would go
down greatly over time. If you killed and resumed it, the utilization
would be back up to where it should be, but after many hours it would
start having problems again. That particular problem was "fixed"
(pronounced "big kludge") by an iterative script that wouldn't let it
get into that state.
More information about the Beowulf