MPI (lack of) Performance?

Mon May 10 20:24:53 1999

Greetings.

We've set up a small (4 node) cluster here hoping to test and determine
whether or not the Hamachi cards will be sufficient for some of the
parallel applications that we are currently running.  I haven't had any
serious problems getting the cards up and running (initially using
Donald Becker's .08, but most recently using Eric Kasten's (kasten? ;)
-- it's lower case in the driver) .14 version).  We already had an MPI
layer, so getting our code running wasn't a huge problem.

But I'm pretty disappointed with the results that I'm getting for actual
MPI throughput, and I'd like to ask if anyone else (using MPI) gets
similar results, or if you had to do anything special to "up" the
performance.  My results (using the simple ping-pong MPI program at the
bottom) inside of MPI are only order 23Mb/s.

In other words, benchmarks looked decent, but when we threw our
application at it, it... well, it sucked.  ;)  I'm trying to find out
why...

Here's some data:
size	count	time	MB/s(*)
2048    5000    3.59    5.44
4096    5000    3.62    10.78
8192    5000    5.14    15.19
16384   5000    8.88    17.60
32768   5000    15.84   19.73	(*) MB/s =
((2*size*count)/time)/(1024*1024)
65536   5000    30.65   20.39
131072  5000    59.40   21.05
262144  5000    111.79  22.36
524288  5000    217.13  23.03
1048576 5000    425.54  23.50
2097152 5000    848.49  23.57

Configuration:
	HP Kayak XA-s
	Pentium II 450
	384 Mb RAM
	Hamachi's (32 bit PCI bus)
	kernel 2.2.5
	driver v.14

Switch is an HP8000M w/ HP gigabit ethernet modules, the gigabit cards
are set to a private 10.* network, traffic outside the 10.* network goes
over the second 10/100 card.

Some (useful??) observations:
(1) lights on the switch are "constant" ... ie, when this is running,
they don't flicker, they're "on."  
(2) load on the CPU is minimal.  Running uptime after a couple of
minutes of running will yield somewhere between .01 and .15 1 minute
load.
(3) packets aren't being lost (or at least reported lost) in the switch
diagnostics.  Order a couple of lost packets every few million.
(4) no error messages are being reported to /var/log/messages.
(5) 'top i' doesn't show this pong application as a non-idle process
(Huh?)
(6) the "other" ethernet card (eth0) *does* have problems.  I haven't
replaced it yet, but the driver (pcnet32) seems really flakey with this
particular card (an HP special, combo ethernet/SCSI card.  Yippee.). 
Does this matter???  Essentially I get occasional error messages like:
"kernel: eth0: Tx FIFO error! Status 02a3." ... but they aren't very
frequent, unless I try running the test over the 100Mbit lines.  Things
get ugly then... but I've tried this with the line disconnected (Hmm. 
Although I haven't tried rmmod'ing the pcnet32 driver ... didn't think
it would matter).

Anyway, what I'd like to solicit is any advice other people might have
on getting MPI programs to perform underneath the Hamachi cards (or just
in general, does MPI perform poorly on Linux??  I wouldn't think so,
with the whole beowulf concept in high visibility, but this is my first
experience with MPI on Linux).  Any command line parameters (I do have
to use the -nolocal option to avoid default routing over the 10/100) or
environment variables that increase performance?  Compiler directives? 
Black magic incantations?

I'd like to be able to present this as a viable alternative to another
big parallel box, but with these numbers, there's just no competition
(for comparison purposes, this exact same code on an old Convex SPP1600
yields order 90MB/sec ... but that's shared memory). 

Code follows.

Thanks!  I'd appreciate any tips!
Jon

PS: These are empty buffers being passed here... a similar version was
used with valid data in the buffer and checking on the other end to make
sure the proper string was received, with no problem.  Data is getting
passed/received properly.

#include <mpi.h>
#include <time.h>
#include <stdio.h>

#define BUFFER_SIZE 5000000
#define MSG_TAG     1

char sendbuf[BUFFER_SIZE];
char recvbuf[BUFFER_SIZE];

int main( int argc, char **argv )
{
  int i,j;
  int rank,nproc;
  char myname[MPI_MAX_PROCESSOR_NAME];
  int  namelen;
  int  result;
  int  size, repcount;
  double t1,t2;
  MPI_Status status;

  MPI_Init( &argc, &argv );
  MPI_Comm_size( MPI_COMM_WORLD, &nproc );
  MPI_Comm_rank( MPI_COMM_WORLD, &rank );
  MPI_Get_processor_name( myname, &namelen );

  if( argc > 1 )
    size = atoi(argv[1]);
  else
    size = 4096;

  if( argc > 2 )
    repcount = atoi( argv[2] );
  else
    repcount = 1000;

  MPI_Errhandler_set( MPI_COMM_WORLD, MPI_ERRORS_ARE_FATAL );

  for( j=0; j<10; j++ )
  {
    MPI_Barrier( MPI_COMM_WORLD );

    t1=MPI_Wtime();

    switch( rank )
    {
      case 0:
        for( i=0; i<repcount; i++ )
        {
          MPI_Send( sendbuf, size, MPI_BYTE, 1, MSG_TAG, MPI_COMM_WORLD
);
          MPI_Recv( recvbuf, size, MPI_BYTE, 1, MSG_TAG, MPI_COMM_WORLD,
&sta
tus );
        }
        break;

      case 1:
        for( i=0; i<repcount; i++ )
        {
          MPI_Recv( recvbuf, size, MPI_BYTE, 0, MSG_TAG, MPI_COMM_WORLD,
&sta
tus );
          MPI_Send( sendbuf, size, MPI_BYTE, 0, MSG_TAG, MPI_COMM_WORLD
);
        }
        break;

      default:
        break;
    }

    t2=MPI_Wtime();

    if( rank == 0 )
    {
      if( j==0 )
        printf( "%d  %d  %.3f", size, repcount, t2-t1 );
      else
        printf( "  %.3f",t2-t1 );
    }
  }

  if( rank == 0 ) printf( "\n" );

  MPI_Finalize();
  exit(0);
}
 | To unsubscribe, send mail to Majordomo@cesdis.gsfc.nasa.gov, and within the
 |  body of the mail, include only the text:
 |   unsubscribe this-list-name youraddress@wherever.org
 | You will be unsubscribed as speedily as possible.