[Beowulf] compute node reboots with bproc/beo tools
Glen.Gardner at verizon.net
Mon Aug 30 05:06:11 PDT 2004
The first thing that comes to mind is that the kernel might be
configured to reboot on a panic, and the machines are rebooting after
something in your software is causing a kernel panic. I've seen this
behavior on my cluster (14 mini itx boards, FreeBSD, MPICH) when I force
a panic by running the machine completely out of resources. Of course,
I've configured mine to do that in order to get out from under crashed
nodes on a bad run. In general one or two nodes will panic first, then
reboot. The job is thus terminated by brute force, sparing the rest of
the nodes from a similar fate. Check the logs in the controlling node
and in the offending nodes to see if you can get more information about
what is going on.
It sounds like you are having troubles with a loadable kernel module.
I'm also wondering if all of your machines have the same kernel ?
V D wrote:
>I have a 5-node Beowulf cluster, with 4 identical "compute" nodes
>(with IDE disk, VIA processor, etc.) & 1 "master" node (more RAM, more
>powerful VIA processor), connected by an unmanaged ethernet switch.
>However, if I use either Scyld 28cz7 (version 3.1.9 bproc) software or
>ClusterMatic 4 (version 4.0.0pre3 of bproc) software and associated
>beoserv/beoboot tools on the cluster (master node), only 2 of the 4
>identical compute nodes come and stay up in the cluster. The other 2
>nodes reboot every 2-6 minutes, either during node_up (apparently
>while insmod/bpsh of some module/library) or after coming up. These 2
>nodes stay up fine if I boot them up with on-disk Linux image with
>networking enabled. However, as soon as I use beo tools to control the
>booting from a "master" node, they have this strange reboot behavior,
>and the master realizes the lost connection soon after. The hardware
>is relatively new (I guess in this case only CPU, RAM and NIC really
>matter), the BIOS RAM tests succeed
>every time, the OS images get downloaded via PXE/beoboot and boot
>phase 2 image fine; but the strange thing is that it is always the
>same 2 physical compute nodes that fail in this way under both
>software systems. I have stripped down the config and fstab scripts
>for the compute nodes to bare minimums.
>Has anyone seen such behavior before? Any hints on how to debug this
>problem? Any help will be greatly appreciated to convert my current
>3-node into the
>maximum 5-node cluster!
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Glen E. Gardner, Jr.
AMSAT MEMBER 10593
More information about the Beowulf