[Beowulf] Parallel Programming Question

Mon Jul 16 12:35:39 PDT 2007

Hello,

I'm new to parallel programming and MPI.  I've developed a simulator in C++
for which I would
like to decrease the running time by using a Beowulf cluster.  I'm not
interested in optimizing
speed, I'm just looking for a quick and easy way to significantly improve
the speed over running
the program on a single machine.
Basically, I need to learn how to parallelize a C++ function.  The following
functions in particular
take the longest to run in my simulator.  The first implements LU
decomposition on a large matrix and
the second implements the backsubstitution method to solve matrix division.

void NR::ludcmp(Mat_IO_DP &a, Vec_O_INT &indx, DP &d)
{
    const DP TINY=1.0e-20;
    int i,imax,j,k;
    DP big,dum,sum,temp;

    int n=a.nrows();
    Vec_DP vv(n);
    d=1.0;
    for (i=0;i<n;i++) {
        big=0.0;
        for (j=0;j<n;j++)
            if ((temp=fabs(a[i][j])) > big) big=temp;
        if (big == 0.0) nrerror("Singular matrix in routine ludcmp");
        vv[i]=1.0/big;
    }
    for (j=0;j<n;j++) {
        for (i=0;i<j;i++) {
            sum=a[i][j];
            for (k=0;k<i;k++) sum -= a[i][k]*a[k][j];
            a[i][j]=sum;
        }
        big=0.0;
        for (i=j;i<n;i++) {
            sum=a[i][j];
            for (k=0;k<j;k++) sum -= a[i][k]*a[k][j];
            a[i][j]=sum;
            if ((dum=vv[i]*fabs(sum)) >= big) {
                big=dum;
                imax=i;
            }
        }
        if (j != imax) {
            for (k=0;k<n;k++) {
                dum=a[imax][k];
                a[imax][k]=a[j][k];
                a[j][k]=dum;
            }
            d = -d;
            vv[imax]=vv[j];
        }
        indx[j]=imax;
        if (a[j][j] == 0.0) a[j][j]=TINY;
        if (j != n-1) {
            dum=1.0/(a[j][j]);
            for (i=j+1;i<n;i++) a[i][j] *= dum;
        }
    }
}

and...

void NR::lubksb(Mat_I_DP &a, Vec_I_INT &indx, Vec_IO_DP &b)
{
    int i,ii=0,ip,j;
    DP sum;

    int n=a.nrows();
    for (i=0;i<n;i++) {
        ip=indx[i];
        sum=b[ip];
        b[ip]=b[i];
        if (ii != 0)
            for (j=ii-1;j<i;j++) sum -= a[i][j]*b[j];
        else if (sum != 0.0)
            ii=i+1;
        b[i]=sum;
    }
    for (i=n-1;i>=0;i--) {
        sum=b[i];
        for (j=i+1;j<n;j++) sum -= a[i][j]*b[j];
        b[i]=sum/a[i][i];
    }
}

(The functions are borrowed from the library provided by "Numerical Recipes
in C++")
I'm currently calling these functions from the main loop with the lines:

NR::ludcmp(c,indx,d);

and

NR::lubksb(c,indx,xv);

where the variable 'c' is a large matrix (holding image pixel values) and
'xv' is a vector
used in backsubstitution.
All of the variables passed into these functions are of types defined in "
nr.h"
'c' is a Mat_DP  (double-precision matrix)
'indx' is a Vec_INT  (integer vector)
'd' is a DP    (double precision)
'xv' is a Vec_DP  (double precision vector)

Is there a simple way to call these functions which will cause the cluster
to distribute the
load of operations?  Currently, when I run the program with a 50x50 array on
a single machine,
it takes about 5 minutes to process a single iteration through the matrix
division.

Any help would be greatly appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20070716/49875fff/attachment.html>