[scyld-users] Cluster up - no action on slaves

Gregg Germain saville at comcast.net
Sat Dec 8 09:16:24 PST 2007


Hi all,

  I have the freeware version of SCYLD Beowulf up and running on a 5 
node system. I've added the 4 slaves to the Master using Beosetup. The 
slaves boot and the status monitor shows them as being up. I can ping 
them using their IP address. I ran the beofdisk, beoboot-install, and 
bpctl commands as instructed by SCYLD.

I have a number of questions, but basically I think all processes are 
running onthe Maser and none on the slaves:

1) What are the node names of the slaves? Are they 0,1,2,3? Or are they 
.0, .1, .2 and .3?


2) I can't ssh into a slave from the master - connection refused. Is 
this normal?

     Is there an account on each slave that I can log into? What would 
it's username and password be?


3)   I ran a simple Hello World program (on the Master and two slaves), 
using MPI calls (not BeoMPI) and I get the following output:

$ mpirun -np 3 HelloWorld
I am the Master! Rank 0, size 3, name localhost.localdomain
Rank 1, size 3, name .0
Rank 2, size 3, name .1

  So things SEEM to be working. However the Beowulf Status Monitor 
statistics portion of the Slave nodes never budge. Ok maybe the program 
runs too quickly to get a reaction.


3) I run the program shown below. I don't have confidence that any 
process is actually running on a slave. So I have the slave (rank > 0) 
do an ifconfig and send the results to a file. I have it open the file 
and extract the IP address, and send that back to the Master for 
printing.  I always get the Master's IP address - never the slaves:

the Master's IP address is 192,168.0.3
the slave's IP addresses are: 192.168.1.100
                               192.168.1.101

Program output:

I am the Master! Rank 0, size 3, name localhost.localdomain

Rank 1, size 3, name .0 Extracted IP address:           inet 
addr:192.168.0.3  Bcast:192.168.0.255  Mask:255.255.255.0

Rank 2, size 3, name .1 Extracted IP address:           inet 
addr:192.168.0.3  Bcast:192.168.0.255  Mask:255.255.255.0

//
// stand alone program to extract an IP address from an ifconfig call
//
#include <stdio.h>
#include <string.h>

#include <stdlib.h>
#include <time.h>
#include <math.h>

//
// MPI includes
//

#include<mpi.h>

/*using namespace std;*/

int main(int argc, char **argv)
{

   int rank, size, partner;
   int namelen;
   char name[MPI_MAX_PROCESSOR_NAME];
   char greeting[sizeof(name) + 100];

   char IPline[sizeof(name) + 100];
   char IPaddress[256];
   char *startstring, *startpos, *endpos;

   int cmpval;

   FILE *IPfile;

   MPI_Init(&argc, &argv);

   MPI_Comm_size(MPI_COMM_WORLD, &size);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Get_processor_name(name, &namelen);
   sprintf(greeting, "Rank %d, size %d, name %s\n", rank,size,name);

   //
   // Now do the important stuff based upon rank
   //

       if(rank == 0)
	{
          sprintf(greeting, "I am the Master! Rank %d, size %d, name 
%s\n", rank,size,name);
	  fputs(greeting, stdout);
	  for(partner=1; partner<size; partner++)
	    {
	      MPI_Status stat;
	      MPI_Recv(greeting,
		       sizeof(greeting),
		       MPI_BYTE,
		       partner,
		       1,
		       MPI_COMM_WORLD,
		       &stat);
	      fputs(greeting, stdout);
	    } /* end for(partner=1; partner<size; partner++) */

	} // end if(rank == 0)
       else // rank is NOT zero - you are a slave
	{
          system("/sbin/ifconfig > IP.txt");
          IPfile = fopen("IP.txt", "r");
            if (!IPfile)
             {
              sprintf(greeting, "\n ERROR - cannot find the file!\n");
              //return -1;
             }
            else
             {
              fgets(IPline, 128, IPfile);
              fgets(IPline, 128, IPfile);

              startpos = strstr(IPline, "192.168");
               if( startpos == NULL)
	        sprintf(greeting, "\n sorry didn't find the IP address\n");
               else
	       {
	        endpos= strstr(IPline, "Bcast" );
                 startstring = IPline;
                 sprintf(greeting, "\nRank %d, size %d, name %s 
Extracted IP address: %s\n", rank, size, name, IPline);



	       }

              } // end if (!IPfile) else


          fclose(IPfile);

	  MPI_Send(greeting,
		   strlen(greeting)+1,
		   MPI_BYTE,
		   0,
		   1,
		   MPI_COMM_WORLD);

	}/* end you are a slave */

   MPI_Finalize();
   exit(0);

}//end MAIN


4) Lastly, I took the above program and inserted a 3 level, time wasting 
loop (for all ranks > 0) which causes the program to take 24 seconds to 
run. When I run it, the stats for the slaves in the Beo Status Monitor 
never budge. The Master's stats fluctuate.

In short I think all the processes are running on the Master and none on 
the slaves. Running "top" on the Master shows all 3 processes on the Master.

What is missing? Is there a networking step that I have to perform to 
get all this to work? /etc/hosts shows only one line:

127.0.0.1               localhost.localdomain localhost

There's nothing in the SCYLD instructions that indicate any other setup 
steps that I have to do.

Thanks for any help you can offer.

Gregg




More information about the Scyld-users mailing list