LAM and channel bonding doesn't work?

Thu Jul 12 19:05:23 PDT 2001

Hi,

sorry to bug you with this, but I am banging my head against the wall
for days now and can find the source of my problem. Thus, I need help...

There seems to be a bug in LAM (I tested versions 6.3.2 and 6.5.2
which I compiled myself with --with-rpi=usysv, and the RPMs
lam-6.5.1-1.i386.rpm (RH 7.1), lam-6.5.2-usysv.1.i386.rpm,
lam-6.5.3-usysv.1.i386.rpm) that shows itself only when channel bonding
is used. Furthermore, it shows up only with a user defined datatype
using MPI_Type_vector. Furthermore, it only shows up when the system
size and consequently the message size is sufficiently large. Also
it only shows up when nonblocking sends/receives (MPI_Isend, MPI_Irecv)
are used. I append my test program below (again, sorry for this; this
is as condensed as I could make it).

Anyway the problem does not show up with mpich-1.2.1 and mpipro-1.6.3
only with lam, which is a problem because I rely on the latency of lam
for performance. 

Anyway, if you by now hit the d key, because of all this bizarreness,
I can't blame you - nevertheless here is the problem:

I compile the program:
# mpicc -O2 -o vector-test vector-test.c

and run it on two processors that must not be on the same machine
(program works fine using two processors on a SMP box):
# lamboot -v

LAM 6.5.2/MPI 2 C++/ROMIO - University of Notre Dame

Executing hboot on n0 (b001 - 1 CPU)...
Executing hboot on n1 (b002 - 1 CPU)...
topology done      
# mpirun -np 2 -O vector-test -2 4000
id=1: MPI_Isend done.
id=1: MPI_Irecv done.
id=0: MPI_Isend done.
id=0: MPI_Irecv done.
id=0: elapsed time: 1.982758 s
id=1: elapsed time: 1.856275 s
# mpirun -np 2 -O vector-test -2 8000
id=0: MPI_Isend done.

and at that point the program hangs forever :-(
Strangely enough, if only one processor does a isend and the other a irecv
the program runs through:

# mpirun -np 2 -O vector-test 8000
id=0: MPI_Irecv done.
id=0: elapsed time: 4.800264 s
id=1: MPI_Isend done.
id=1: elapsed time: 4.737434 s

Also with blocking send/recv it runs:
# mpirun -np 2 -O vector-test -b -2 8000
id=0: MPI_Recv done.
id=1: MPI_Send done.
id=0: MPI_Send done.
id=0: elapsed time: 16.022285 s
id=1: MPI_Recv done.
id=1: elapsed time: 15.358338 s

Under mpich there is no problem either:
# mpirun -np 2 vector-test 8000
id=1: MPI_Isend done.
id=0: MPI_Isend done.
id=0: MPI_Irecv done.
id=0: elapsed time: 10.939983 s
id=1: MPI_Irecv done.
id=1: elapsed time: 11.625667 s

Does anybody have an idea what the problem could be? How can I debug this?
If there is somebody on the list who uses LAM and channel bonding: would
you be willing to verify or disprove this? - You probably have to change
the system size, I strongly suspect that the critical size is memory dependent:
my numbers (4000 and 8000) are for 512MB of RAM. 

Sorry again for dumping this on the list.
Regards,
Martin

========================================================================
Martin Siegert
Academic Computing Services                        phone: (604) 291-4691
Simon Fraser University                            fax:   (604) 291-4242
Burnaby, British Columbia                          email: siegert at sfu.ca
Canada  V5A 1S6
========================================================================

---<cut here: vector-test.c>--------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>

/* allocate a 2d array of doubles with subscript range
   dm[min_x,...,max_x][min_y,...,max_y] contigously in memory */
double **alloc_darray2d(int min_x, int max_x, int min_y, int max_y,
                         int *ialloc_err)
{
  int i,nx=max_x-min_x+1,ny=max_y-min_y+1;
  double **dm;

  *ialloc_err=0;
  dm=(double **)malloc((size_t) nx*sizeof(double*));
  if (dm == NULL){
    *ialloc_err=1;
    return dm;
  }
  dm -= min_x;
  dm[min_x]=(double *)malloc((size_t) nx*ny*sizeof(double));
  if (dm[min_x]==NULL){
    *ialloc_err=2;
    return dm;
  }
  dm[min_x]-=min_y;
  for (i=min_x+1; i <= max_x; i++) dm[i] = dm[i-1]+ny;
/* return pointer to array of pointers to rows */
  return dm;
}

void free_darray2d(double **dm,int min_x, int max_x, int min_y, int max_y)
{
  free((void *) (dm[min_x]+min_y));
  free((void *) (dm+min_x));
}

int main(int argc,char *argv[]){
/* This program tests message passing with a user defined data type:
   each processor allocates a matrix of size L x (L/numprocs), where
   numprocs is the # of processors that are used in the computation
   (from MPI_Comm_size). The matrix is split into numprocs blocks of
   size (L/numprocs) x (L/numprocs). Hence, the data in each block are
   are  (L/numprocs) arrays of size (L/numprocs). These arrays are stored
   a stride L appart. These blocks are defined as a new MPI datatype using
   MPI_Type_vector. Each processor sends one block to the next processor
   and receives one block from the previous processor. */ 
double **d_matrix;
double *work,*local_data;
double start_time,elapsed_time;

int l,lm1,lx,ly,i,j,k,ialloc_err;
int myid,numprocs,l_proc,l2_proc,n_local,id_send,id_recv,idx_send,idx_recv;
int blocking=0,send_and_receive=0;
int c,errflag=0;
extern char *optarg;
extern int optind;

MPI_Datatype block;
MPI_Request s_req;
MPI_Request r_req;
MPI_Status s_status;
MPI_Status r_status;

   MPI_Init(&argc,&argv);
   MPI_Comm_rank( MPI_COMM_WORLD, &myid );
   MPI_Comm_size( MPI_COMM_WORLD, &numprocs );

   while ((c = getopt(argc, argv, "b2")) != -1 ) {
     switch (c) {
     case 'b' : blocking=1;
                break;
     case '2' : send_and_receive=1;
                break;
     case '?' : errflag++;
     }
   }
   if (argc != optind+1) errflag++;
   if (sscanf(argv[optind++],"%i",&l) != 1)  errflag++;

   if (errflag) {
     if (myid == 0) {
       fprintf(stderr,"usage: %s [-b] [-2] L\n",argv[0]);
       fprintf(stderr,"with LxL/(# of procs) being the size of the matrix"
                      " (int).\n"
                      "If -b is specified blocking send/recv are used.\n"
                      "If -2 is specified, each process sends and receives."
                      "\n");
     }
     MPI_Finalize();
     exit(1);
   }
   l=((int)(((double)l)/numprocs+0.5))*numprocs;
   lm1=l-1;
   l_proc=l/numprocs;
   ly=l_proc;
   l2_proc=l_proc*l_proc;
   n_local=l_proc*l;

/* allocate matrix, work array */
   d_matrix=alloc_darray2d(0,l_proc-1,0,lm1,&ialloc_err);
   if (ialloc_err) {
      fprintf(stderr,"id=%i: matrix allocation error %i\n",myid,ialloc_err);
      MPI_Abort(MPI_COMM_WORLD,ialloc_err);
      exit(ialloc_err);
   }
   work=(double *)malloc(n_local*sizeof(double));
   if (work == NULL){
     fprintf(stderr,"id=%i: work allocation error\n",myid);
     MPI_Abort(MPI_COMM_WORLD,1);
     exit(1);
   }

/* define datatype of a block to be sent: a total of l_proc*l_proc
   elements, stored in l_proc arrays of size l_proc that are a
   stride l apart */
    MPI_Type_vector(l_proc,l_proc,l,MPI_DOUBLE,&block);
    MPI_Type_commit(&block);

/* initialize matrix */
   for (i=0; i<l_proc; i++){
     for (j=0; j<l; j++){
       d_matrix[i][j] = (double)myid;
     }
   }

/* send block to id+1, recv block from id-1 
   the send block starts at myid*l_proc
   the data are received as a contigous block of l_proc*l_proc doubles */
   id_send = (myid + 1) % numprocs;
   id_recv = (myid - 1 + numprocs) % numprocs;
   idx_send = myid*l_proc;
   idx_recv = id_recv*l2_proc;
   start_time=MPI_Wtime();
   if (blocking) {
      if (myid % 2) {
        MPI_Send(&d_matrix[0][idx_send],1,block,id_send,myid,
                 MPI_COMM_WORLD);
        fprintf(stderr,"id=%i: MPI_Send done.\n",myid);
        if (send_and_receive) {
          MPI_Recv(&work[idx_recv],l2_proc,MPI_DOUBLE,id_recv,id_recv,
                   MPI_COMM_WORLD,&r_status);
          fprintf(stderr,"id=%i: MPI_Recv done.\n",myid);
        }
      } else {
        MPI_Recv(&work[idx_recv],l2_proc,MPI_DOUBLE,id_recv,id_recv,
                 MPI_COMM_WORLD,&r_status);
        fprintf(stderr,"id=%i: MPI_Recv done.\n",myid);
        if (send_and_receive) {
          MPI_Send(&d_matrix[0][idx_send],1,block,id_send,myid,
                   MPI_COMM_WORLD);
          fprintf(stderr,"id=%i: MPI_Send done.\n",myid);
        }
      }
   } else {
      if (send_and_receive) {
        MPI_Irecv(&work[idx_recv],l2_proc,MPI_DOUBLE,id_recv,id_recv,
                  MPI_COMM_WORLD,&r_req);
        MPI_Isend(&d_matrix[0][idx_send],1,block,id_send,myid,
                  MPI_COMM_WORLD,&s_req);
        MPI_Wait(&s_req,&s_status);
        fprintf(stderr,"id=%i: MPI_Isend done.\n",myid);
        MPI_Wait(&r_req,&r_status);
        fprintf(stderr,"id=%i: MPI_Irecv done.\n",myid);
     } else {
        if (myid % 2) {
          MPI_Isend(&d_matrix[0][idx_send],1,block,id_send,myid,
                    MPI_COMM_WORLD,&s_req);
          MPI_Wait(&s_req,&s_status);
          fprintf(stderr,"id=%i: MPI_Isend done.\n",myid);
        } else {
          MPI_Irecv(&work[idx_recv],l2_proc,MPI_DOUBLE,id_recv,id_recv,
                    MPI_COMM_WORLD,&r_req);
          MPI_Wait(&r_req,&r_status);
          fprintf(stderr,"id=%i: MPI_Irecv done.\n",myid);
        }
      }
   }
   elapsed_time=MPI_Wtime()-start_time;
   printf("id=%i: elapsed time: %f s\n",myid,elapsed_time);
   MPI_Finalize();
}