LAM and channel bonding doesn't work?
Martin Siegert
siegert at sfu.ca
Thu Jul 12 19:05:23 PDT 2001
Hi,
sorry to bug you with this, but I am banging my head against the wall
for days now and can find the source of my problem. Thus, I need help...
There seems to be a bug in LAM (I tested versions 6.3.2 and 6.5.2
which I compiled myself with --with-rpi=usysv, and the RPMs
lam-6.5.1-1.i386.rpm (RH 7.1), lam-6.5.2-usysv.1.i386.rpm,
lam-6.5.3-usysv.1.i386.rpm) that shows itself only when channel bonding
is used. Furthermore, it shows up only with a user defined datatype
using MPI_Type_vector. Furthermore, it only shows up when the system
size and consequently the message size is sufficiently large. Also
it only shows up when nonblocking sends/receives (MPI_Isend, MPI_Irecv)
are used. I append my test program below (again, sorry for this; this
is as condensed as I could make it).
Anyway the problem does not show up with mpich-1.2.1 and mpipro-1.6.3
only with lam, which is a problem because I rely on the latency of lam
for performance.
Anyway, if you by now hit the d key, because of all this bizarreness,
I can't blame you - nevertheless here is the problem:
I compile the program:
# mpicc -O2 -o vector-test vector-test.c
and run it on two processors that must not be on the same machine
(program works fine using two processors on a SMP box):
# lamboot -v
LAM 6.5.2/MPI 2 C++/ROMIO - University of Notre Dame
Executing hboot on n0 (b001 - 1 CPU)...
Executing hboot on n1 (b002 - 1 CPU)...
topology done
# mpirun -np 2 -O vector-test -2 4000
id=1: MPI_Isend done.
id=1: MPI_Irecv done.
id=0: MPI_Isend done.
id=0: MPI_Irecv done.
id=0: elapsed time: 1.982758 s
id=1: elapsed time: 1.856275 s
# mpirun -np 2 -O vector-test -2 8000
id=0: MPI_Isend done.
and at that point the program hangs forever :-(
Strangely enough, if only one processor does a isend and the other a irecv
the program runs through:
# mpirun -np 2 -O vector-test 8000
id=0: MPI_Irecv done.
id=0: elapsed time: 4.800264 s
id=1: MPI_Isend done.
id=1: elapsed time: 4.737434 s
Also with blocking send/recv it runs:
# mpirun -np 2 -O vector-test -b -2 8000
id=0: MPI_Recv done.
id=1: MPI_Send done.
id=0: MPI_Send done.
id=0: elapsed time: 16.022285 s
id=1: MPI_Recv done.
id=1: elapsed time: 15.358338 s
Under mpich there is no problem either:
# mpirun -np 2 vector-test 8000
id=1: MPI_Isend done.
id=0: MPI_Isend done.
id=0: MPI_Irecv done.
id=0: elapsed time: 10.939983 s
id=1: MPI_Irecv done.
id=1: elapsed time: 11.625667 s
Does anybody have an idea what the problem could be? How can I debug this?
If there is somebody on the list who uses LAM and channel bonding: would
you be willing to verify or disprove this? - You probably have to change
the system size, I strongly suspect that the critical size is memory dependent:
my numbers (4000 and 8000) are for 512MB of RAM.
Sorry again for dumping this on the list.
Regards,
Martin
========================================================================
Martin Siegert
Academic Computing Services phone: (604) 291-4691
Simon Fraser University fax: (604) 291-4242
Burnaby, British Columbia email: siegert at sfu.ca
Canada V5A 1S6
========================================================================
---<cut here: vector-test.c>--------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
/* allocate a 2d array of doubles with subscript range
dm[min_x,...,max_x][min_y,...,max_y] contigously in memory */
double **alloc_darray2d(int min_x, int max_x, int min_y, int max_y,
int *ialloc_err)
{
int i,nx=max_x-min_x+1,ny=max_y-min_y+1;
double **dm;
*ialloc_err=0;
dm=(double **)malloc((size_t) nx*sizeof(double*));
if (dm == NULL){
*ialloc_err=1;
return dm;
}
dm -= min_x;
dm[min_x]=(double *)malloc((size_t) nx*ny*sizeof(double));
if (dm[min_x]==NULL){
*ialloc_err=2;
return dm;
}
dm[min_x]-=min_y;
for (i=min_x+1; i <= max_x; i++) dm[i] = dm[i-1]+ny;
/* return pointer to array of pointers to rows */
return dm;
}
void free_darray2d(double **dm,int min_x, int max_x, int min_y, int max_y)
{
free((void *) (dm[min_x]+min_y));
free((void *) (dm+min_x));
}
int main(int argc,char *argv[]){
/* This program tests message passing with a user defined data type:
each processor allocates a matrix of size L x (L/numprocs), where
numprocs is the # of processors that are used in the computation
(from MPI_Comm_size). The matrix is split into numprocs blocks of
size (L/numprocs) x (L/numprocs). Hence, the data in each block are
are (L/numprocs) arrays of size (L/numprocs). These arrays are stored
a stride L appart. These blocks are defined as a new MPI datatype using
MPI_Type_vector. Each processor sends one block to the next processor
and receives one block from the previous processor. */
double **d_matrix;
double *work,*local_data;
double start_time,elapsed_time;
int l,lm1,lx,ly,i,j,k,ialloc_err;
int myid,numprocs,l_proc,l2_proc,n_local,id_send,id_recv,idx_send,idx_recv;
int blocking=0,send_and_receive=0;
int c,errflag=0;
extern char *optarg;
extern int optind;
MPI_Datatype block;
MPI_Request s_req;
MPI_Request r_req;
MPI_Status s_status;
MPI_Status r_status;
MPI_Init(&argc,&argv);
MPI_Comm_rank( MPI_COMM_WORLD, &myid );
MPI_Comm_size( MPI_COMM_WORLD, &numprocs );
while ((c = getopt(argc, argv, "b2")) != -1 ) {
switch (c) {
case 'b' : blocking=1;
break;
case '2' : send_and_receive=1;
break;
case '?' : errflag++;
}
}
if (argc != optind+1) errflag++;
if (sscanf(argv[optind++],"%i",&l) != 1) errflag++;
if (errflag) {
if (myid == 0) {
fprintf(stderr,"usage: %s [-b] [-2] L\n",argv[0]);
fprintf(stderr,"with LxL/(# of procs) being the size of the matrix"
" (int).\n"
"If -b is specified blocking send/recv are used.\n"
"If -2 is specified, each process sends and receives."
"\n");
}
MPI_Finalize();
exit(1);
}
l=((int)(((double)l)/numprocs+0.5))*numprocs;
lm1=l-1;
l_proc=l/numprocs;
ly=l_proc;
l2_proc=l_proc*l_proc;
n_local=l_proc*l;
/* allocate matrix, work array */
d_matrix=alloc_darray2d(0,l_proc-1,0,lm1,&ialloc_err);
if (ialloc_err) {
fprintf(stderr,"id=%i: matrix allocation error %i\n",myid,ialloc_err);
MPI_Abort(MPI_COMM_WORLD,ialloc_err);
exit(ialloc_err);
}
work=(double *)malloc(n_local*sizeof(double));
if (work == NULL){
fprintf(stderr,"id=%i: work allocation error\n",myid);
MPI_Abort(MPI_COMM_WORLD,1);
exit(1);
}
/* define datatype of a block to be sent: a total of l_proc*l_proc
elements, stored in l_proc arrays of size l_proc that are a
stride l apart */
MPI_Type_vector(l_proc,l_proc,l,MPI_DOUBLE,&block);
MPI_Type_commit(&block);
/* initialize matrix */
for (i=0; i<l_proc; i++){
for (j=0; j<l; j++){
d_matrix[i][j] = (double)myid;
}
}
/* send block to id+1, recv block from id-1
the send block starts at myid*l_proc
the data are received as a contigous block of l_proc*l_proc doubles */
id_send = (myid + 1) % numprocs;
id_recv = (myid - 1 + numprocs) % numprocs;
idx_send = myid*l_proc;
idx_recv = id_recv*l2_proc;
start_time=MPI_Wtime();
if (blocking) {
if (myid % 2) {
MPI_Send(&d_matrix[0][idx_send],1,block,id_send,myid,
MPI_COMM_WORLD);
fprintf(stderr,"id=%i: MPI_Send done.\n",myid);
if (send_and_receive) {
MPI_Recv(&work[idx_recv],l2_proc,MPI_DOUBLE,id_recv,id_recv,
MPI_COMM_WORLD,&r_status);
fprintf(stderr,"id=%i: MPI_Recv done.\n",myid);
}
} else {
MPI_Recv(&work[idx_recv],l2_proc,MPI_DOUBLE,id_recv,id_recv,
MPI_COMM_WORLD,&r_status);
fprintf(stderr,"id=%i: MPI_Recv done.\n",myid);
if (send_and_receive) {
MPI_Send(&d_matrix[0][idx_send],1,block,id_send,myid,
MPI_COMM_WORLD);
fprintf(stderr,"id=%i: MPI_Send done.\n",myid);
}
}
} else {
if (send_and_receive) {
MPI_Irecv(&work[idx_recv],l2_proc,MPI_DOUBLE,id_recv,id_recv,
MPI_COMM_WORLD,&r_req);
MPI_Isend(&d_matrix[0][idx_send],1,block,id_send,myid,
MPI_COMM_WORLD,&s_req);
MPI_Wait(&s_req,&s_status);
fprintf(stderr,"id=%i: MPI_Isend done.\n",myid);
MPI_Wait(&r_req,&r_status);
fprintf(stderr,"id=%i: MPI_Irecv done.\n",myid);
} else {
if (myid % 2) {
MPI_Isend(&d_matrix[0][idx_send],1,block,id_send,myid,
MPI_COMM_WORLD,&s_req);
MPI_Wait(&s_req,&s_status);
fprintf(stderr,"id=%i: MPI_Isend done.\n",myid);
} else {
MPI_Irecv(&work[idx_recv],l2_proc,MPI_DOUBLE,id_recv,id_recv,
MPI_COMM_WORLD,&r_req);
MPI_Wait(&r_req,&r_status);
fprintf(stderr,"id=%i: MPI_Irecv done.\n",myid);
}
}
}
elapsed_time=MPI_Wtime()-start_time;
printf("id=%i: elapsed time: %f s\n",myid,elapsed_time);
MPI_Finalize();
}
More information about the Beowulf
mailing list