[Beowulf] Node Drop-Off
Chris Samuel
csamuel at vpac.org
Mon Nov 13 05:36:26 PST 2006
On Sunday 12 November 2006 16:13, Tim Moore wrote:
> Has anyone ever seen such behavior?
Others have mentioned about attaching consoles, etc, but it's also worth
trawling through any logs in /var/log to see if anything is showing up there
too.
Check dmesg whilst the node is under load, if you're seeing machine check
problems, ECC parity problems, SCSI errors then you might catch them then
(though they should also be in the logs too).
If the node supports IPMI try and use that to get to any hardware logs, and if
you use Ganglia to monitor the cluster have a look at that and see if there's
anything there that could show if it's a user space program that could be
causing it.
I know users shouldn't be able to crash nodes, but we have seen that on some
kernels where the OOM killer is not very good at getting things right and the
machine deadlocks when the users program runs it out of RAM.
Another possibility is bad blocks in the swap partition which might only show
up in low memory conditions (yes, using swap is bad, but people write bad
code too) and corrupt something essential that's been paged out.
What does uname -a say on the box ?
cheers!
Chris
--
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20061113/d8ab1832/attachment.sig>
More information about the Beowulf
mailing list