[scyld-users] Cluster up - no action on slaves

Sat Dec 8 10:42:39 PST 2007

On Sat, 8 Dec 2007, Gregg Germain wrote:

>  I have the freeware version of SCYLD Beowulf up and running on a 5
> node system. I've added the 4 slaves to the Master using Beosetup. The
> slaves boot and the status monitor shows them as being up. I can ping
> them using their IP address. I ran the beofdisk, beoboot-install, and
> bpctl commands as instructed by SCYLD.
>
> I have a number of questions, but basically I think all processes are
> running onthe Maser and none on the slaves:

> 1) What are the node names of the slaves? Are they 0,1,2,3? Or are they
> .0, .1, .2 and .3?

The slave nodes are permanently numbered, starting with 0.  The ".23"
names are interpreted by the Scyld tools and the BeoNSS name service.

The first time a node boots to the point where it's a usable slave node,
it's assigned a node number.  That node number is associated with the
network MAC address and written to the /etc/beowulf/config file.  It
stays the same until it's manually changed in the config file.  (The
'beosetup' program changes the configuration file and then signals the
booting and BProc subsystems so that they immediately notice the changes.)

Most of our tools accept just the node number, but other tools and
applications expect host names.  The BeoNSS subsystem is a plug-in name
service that translates ".23" into the proper IP address.

Over time BeoNSS has evolved to accept a wider range of node names.
Older versions only accepted ".23", "cluster.23" and "23.cluster", and
returned ".23" as the only name on reverse look-ups.
New versions allow specifying the node name format in the config file so
you can name your nodes "myclusternode23", and that will be returned as
the preferred host name.

The 'hostname' part of BeoNSS uses the knowledge that the Scyld boot
system assigns nodes sequential IP addresses.  It only needs to know the
IP address of node '0' (changing this will trigger a cluster reboot),
and the maximum node number (updated consistently in the rare case it
changes) to calculate node IP addresses.  Not only is this a consistency
and reliability improvement over the traditional approaches, it's a
major performance win when establishing all-to-all communication on
large clusters.

As a convenience, we have a few special cases.  The current master has a
number and nickname of '.-1'.  The current node is '.-2'.  This allows
you to store node numbers as integers without special-casing the output.

> 2) I can't ssh into a slave from the master - connection refused. Is
> this normal?

Yes, it's the expected behavior.  It part of the most interesting aspect
of the Scyld cluster architecture.

On the Scyld system instead using 'ssh' or 'rsh', you start an
interactive shell on the slave node just as you would start an
application.

   bpsh .23 /bin/sh -i

Your will automatically be using the master's current version of
/bin/sh, your current environment variables.  You will be placed into
your current working directory (or '/' if the path doesn't exist on the
slave node).

We call them "slave nodes" instead of "compute nodes" in the basic
configuration because they are directly controlled by a master node.

One aspect of this is that basic slave nodes have no deamons or services
running on them.  They are running only the programs the master
started.  This results in a very fast boot and lightweight environment,
leaving almost all of the memory and a clean environment to applications
to run in.

Having a very simple slave node environment doesn't limit what you can
do or run on the slave node, but it does mean you have to change your
perspective slightly, and sometimes create a more traditional
environment for some applications.

Starting an interactive shell on the compute node is actually a more
simpler and more natural solution if you aren't already experienced with
'ssh' or 'rsh'.  Traditionally you would need create an account on the
compute node (perhaps by copying out /etc/passwd and /etc/group), make
certain that you have a home directory, and make certain that you have
'rsh' or 'ssh' configured (which can be quite tricky from scratch).

And when you do log into the compute node using 'ssh', you may not have
the environment you expect.  The user shell might be different, your
environment variables are probably not the same, and any environment
variables you set interactively are certainly not the same.

I could continue with all of the other reasons why this is a much better
approach (security, no scheduling interference, consistency) if you are
interested.

> 3)   I ran a simple Hello World program (on the Master and two slaves),
> using MPI calls (not BeoMPI) and I get the following output:
>
> $ mpirun -np 3 HelloWorld
> I am the Master! Rank 0, size 3, name localhost.localdomain
> Rank 1, size 3, name .0
> Rank 2, size 3, name .1

> So things SEEM to be working. However the Beowulf Status Monitor
> statistics portion of the Slave nodes never budge. Ok maybe the program
> runs too quickly to get a reaction.

It's likely that you are not seeing anything on the display because your
program is so trivial.

Compounding that is that the display tool, Beostatus, is set to a 5
second display update period by default.  Beostat is reporting from each
node once per second, so it's best to change the Beostatus update period
to once per second.

'Beostat' is the name of the subsystem that gathers per-node state,
status and statistics.  It's also the name of the user-level program to
display some of those statistics.  The Beostat system allows nodes to
send only one performance report.  They report once per second, and a
'recvstat' process on the master writes the report in a shared memory
region.  Any program on the master can read this memory, so you can have
many schedulers, display tools and mappers running without increasing
the load on the compute nodes.  

> 3) I run the program shown below. I don't have confidence that any
> process is actually running on a slave. So I have the slave (rank > 0)
> do an ifconfig and send the results to a file. I have it open the file
> and extract the IP address, and send that back to the Master for
> printing.  I always get the Master's IP address - never the slaves:

The run above certainly spread the processes out over the cluster.  This
run didn't seem to.

You can test how a job will be 'mapped' (spread out over the cluster) by
running the 'beomap' program.  This calls the mapping function that MPI
will use and shows the output

With four single-core nodes the output will likely look like:

  prompt> beomap -np 4
     -1:0:1:2
  prompt> beomap -np 4  -no-local
     0:1:2:3
  prompt> export NP=4 NO_LOCAL=1
  prompt> beomap
     0:1:2:3

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA