mpi-prog porting from lam -> scyld beowulf mpi difficulties

Peter Beerli beerli at
Wed Nov 28 17:03:46 PST 2001

I have a program developed using MPI-1 under LAM.
It runs fine on several LAM-MPI clusters with different architecture.
A user wants to run it on a Scyld-beowulf cluster and there it fails.
I did a few tests myself and it seems
that the program stalls if run on more than 3 nodes, but seems to work for
2-3 nodes. The program has master-slaves architectures where the master
is mostly doing nothing. There are some reports sent to stdout from any node
(but this seems to work in beompi the same way as in LAM). 
There are several things unclear to me
because I have no clue about the beompi system, beowulf and scyld in

(1) if I run "top" why do I see 6 processes running when I start
    with mpirun -np 3 migrate-n ? 

(2) The data-phase stalls on the slave nodes.
    The master node is reading the data from a file and then broadcasts
    a large char buffer to the slaves. Is this wrong, is there a better way
    to do that [I do not know how big the data is and it is a complex mix
    of strings numbers etc.]

broadcast_data_master (data_fmt * data, option_fmt * options)
  long bufsize;
  char *buffer;
  buffer = (char *) calloc (1, sizeof (char));
  bufsize = pack_databuffer (&buffer, data, options);
  MPI_Bcast (&bufsize, 1, MPI_LONG, MASTER, comm_world);
  MPI_Bcast (buffer, bufsize, MPI_CHAR, MASTER, comm_world);
  free (buffer);

broadcast_data_worker (data_fmt * data, option_fmt * options)
  long bufsize;
  char *buffer;
  MPI_Bcast (&bufsize, 1, MPI_LONG, MASTER, comm_world);
  buffer = (char *) calloc (bufsize, sizeof (char));
  MPI_Bcast (buffer, bufsize, MPI_CHAR, MASTER, comm_world);
  unpack_databuffer (buffer, data, options);
  free (buffer);

  the master and the first node seem to read the data fine
   but the others either don't and wait or silently die.
(3) what is the easiest way to debug this? With LAM I just attached to pids to
    in gdb on the different nodes, but here the nodes are transparent to me
    [but as I said I have never used a beowulf cluster before].

Can you give pointers, hints

Peter Beerli,  Genome Sciences, Box #357730, University of Washington,
Seattle WA 98195-7730 USA, Ph:2065438751, Fax:2065430754

More information about the Beowulf mailing list