[scyld-users] Cluster up - no action on slaves
Gregg Germain
saville at comcast.net
Sat Dec 8 09:16:24 PST 2007
Hi all,
I have the freeware version of SCYLD Beowulf up and running on a 5
node system. I've added the 4 slaves to the Master using Beosetup. The
slaves boot and the status monitor shows them as being up. I can ping
them using their IP address. I ran the beofdisk, beoboot-install, and
bpctl commands as instructed by SCYLD.
I have a number of questions, but basically I think all processes are
running onthe Maser and none on the slaves:
1) What are the node names of the slaves? Are they 0,1,2,3? Or are they
.0, .1, .2 and .3?
2) I can't ssh into a slave from the master - connection refused. Is
this normal?
Is there an account on each slave that I can log into? What would
it's username and password be?
3) I ran a simple Hello World program (on the Master and two slaves),
using MPI calls (not BeoMPI) and I get the following output:
$ mpirun -np 3 HelloWorld
I am the Master! Rank 0, size 3, name localhost.localdomain
Rank 1, size 3, name .0
Rank 2, size 3, name .1
So things SEEM to be working. However the Beowulf Status Monitor
statistics portion of the Slave nodes never budge. Ok maybe the program
runs too quickly to get a reaction.
3) I run the program shown below. I don't have confidence that any
process is actually running on a slave. So I have the slave (rank > 0)
do an ifconfig and send the results to a file. I have it open the file
and extract the IP address, and send that back to the Master for
printing. I always get the Master's IP address - never the slaves:
the Master's IP address is 192,168.0.3
the slave's IP addresses are: 192.168.1.100
192.168.1.101
Program output:
I am the Master! Rank 0, size 3, name localhost.localdomain
Rank 1, size 3, name .0 Extracted IP address: inet
addr:192.168.0.3 Bcast:192.168.0.255 Mask:255.255.255.0
Rank 2, size 3, name .1 Extracted IP address: inet
addr:192.168.0.3 Bcast:192.168.0.255 Mask:255.255.255.0
//
// stand alone program to extract an IP address from an ifconfig call
//
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
//
// MPI includes
//
#include<mpi.h>
/*using namespace std;*/
int main(int argc, char **argv)
{
int rank, size, partner;
int namelen;
char name[MPI_MAX_PROCESSOR_NAME];
char greeting[sizeof(name) + 100];
char IPline[sizeof(name) + 100];
char IPaddress[256];
char *startstring, *startpos, *endpos;
int cmpval;
FILE *IPfile;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(name, &namelen);
sprintf(greeting, "Rank %d, size %d, name %s\n", rank,size,name);
//
// Now do the important stuff based upon rank
//
if(rank == 0)
{
sprintf(greeting, "I am the Master! Rank %d, size %d, name
%s\n", rank,size,name);
fputs(greeting, stdout);
for(partner=1; partner<size; partner++)
{
MPI_Status stat;
MPI_Recv(greeting,
sizeof(greeting),
MPI_BYTE,
partner,
1,
MPI_COMM_WORLD,
&stat);
fputs(greeting, stdout);
} /* end for(partner=1; partner<size; partner++) */
} // end if(rank == 0)
else // rank is NOT zero - you are a slave
{
system("/sbin/ifconfig > IP.txt");
IPfile = fopen("IP.txt", "r");
if (!IPfile)
{
sprintf(greeting, "\n ERROR - cannot find the file!\n");
//return -1;
}
else
{
fgets(IPline, 128, IPfile);
fgets(IPline, 128, IPfile);
startpos = strstr(IPline, "192.168");
if( startpos == NULL)
sprintf(greeting, "\n sorry didn't find the IP address\n");
else
{
endpos= strstr(IPline, "Bcast" );
startstring = IPline;
sprintf(greeting, "\nRank %d, size %d, name %s
Extracted IP address: %s\n", rank, size, name, IPline);
}
} // end if (!IPfile) else
fclose(IPfile);
MPI_Send(greeting,
strlen(greeting)+1,
MPI_BYTE,
0,
1,
MPI_COMM_WORLD);
}/* end you are a slave */
MPI_Finalize();
exit(0);
}//end MAIN
4) Lastly, I took the above program and inserted a 3 level, time wasting
loop (for all ranks > 0) which causes the program to take 24 seconds to
run. When I run it, the stats for the slaves in the Beo Status Monitor
never budge. The Master's stats fluctuate.
In short I think all the processes are running on the Master and none on
the slaves. Running "top" on the Master shows all 3 processes on the Master.
What is missing? Is there a networking step that I have to perform to
get all this to work? /etc/hosts shows only one line:
127.0.0.1 localhost.localdomain localhost
There's nothing in the SCYLD instructions that indicate any other setup
steps that I have to do.
Thanks for any help you can offer.
Gregg
More information about the Scyld-users
mailing list