[Beowulf] compute node reboots with bproc/beo tools
vipuld at gmail.com
Mon Aug 30 22:03:06 PDT 2004
Thanks Glen. But the kernel is not configured to reboot on panic; in
fact there is no panic or oops - the compute node just reboots with
mostly the trace of "bpsh <nodenum> --stdout /dev/console cat" on the
master node. There is ample memory free on all the nodes; and yes the
same kernel image and initrd are provided to all the compute nodes to
run out of ramdisk-based root fs.
On Mon, 30 Aug 2004 08:06:11 -0400, Glen Gardner
<glen.gardner at verizon.net> wrote:
> The first thing that comes to mind is that the kernel might be
> configured to reboot on a panic, and the machines are rebooting after
> something in your software is causing a kernel panic. I've seen this
> behavior on my cluster (14 mini itx boards, FreeBSD, MPICH) when I force
> a panic by running the machine completely out of resources. Of course,
> I've configured mine to do that in order to get out from under crashed
> nodes on a bad run. In general one or two nodes will panic first, then
> reboot. The job is thus terminated by brute force, sparing the rest of
> the nodes from a similar fate. Check the logs in the controlling node
> and in the offending nodes to see if you can get more information about
> what is going on.
> It sounds like you are having troubles with a loadable kernel module.
> I'm also wondering if all of your machines have the same kernel ?
> V D wrote:
> >Hi folks,
> >I have a 5-node Beowulf cluster, with 4 identical "compute" nodes
> >(with IDE disk, VIA processor, etc.) & 1 "master" node (more RAM, more
> >powerful VIA processor), connected by an unmanaged ethernet switch.
> >However, if I use either Scyld 28cz7 (version 3.1.9 bproc) software or
> >ClusterMatic 4 (version 4.0.0pre3 of bproc) software and associated
> >beoserv/beoboot tools on the cluster (master node), only 2 of the 4
> >identical compute nodes come and stay up in the cluster. The other 2
> >nodes reboot every 2-6 minutes, either during node_up (apparently
> >while insmod/bpsh of some module/library) or after coming up. These 2
> >nodes stay up fine if I boot them up with on-disk Linux image with
> >networking enabled. However, as soon as I use beo tools to control the
> >booting from a "master" node, they have this strange reboot behavior,
> >and the master realizes the lost connection soon after. The hardware
> >is relatively new (I guess in this case only CPU, RAM and NIC really
> >matter), the BIOS RAM tests succeed
> >every time, the OS images get downloaded via PXE/beoboot and boot
> >phase 2 image fine; but the strange thing is that it is always the
> >same 2 physical compute nodes that fail in this way under both
> >software systems. I have stripped down the config and fstab scripts
> >for the compute nodes to bare minimums.
> >Has anyone seen such behavior before? Any hints on how to debug this
> >problem? Any help will be greatly appreciated to convert my current
> >3-node into the
> >maximum 5-node cluster!
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> Glen E. Gardner, Jr.
> AMSAT MEMBER 10593
More information about the Beowulf