[Beowulf] Node Drop-Off

Chris Samuel csamuel at vpac.org
Mon Nov 13 05:36:26 PST 2006

On Sunday 12 November 2006 16:13, Tim Moore wrote:

> Has anyone ever seen such behavior?

Others have mentioned about attaching consoles, etc, but it's also worth 
trawling through any logs in /var/log to see if anything is showing up there 

Check dmesg whilst the node is under load, if you're seeing machine check 
problems, ECC parity problems, SCSI errors then you might catch them then 
(though they should also be in the logs too).

If the node supports IPMI try and use that to get to any hardware logs, and if 
you use Ganglia to monitor the cluster have a look at that and see if there's 
anything there that could show if it's a user space program that could be 
causing it.

I know users shouldn't be able to crash nodes, but we have seen that on some 
kernels where the OOM killer is not very good at getting things right and the 
machine deadlocks when the users program runs it out of RAM.

Another possibility is bad blocks in the swap partition which might only show 
up in low memory conditions (yes, using swap is bad, but people write bad 
code too) and corrupt something essential that's been paged out.

What does uname -a say on the box ?

