AGAIN: mpi-prog from lam -> scyld beompi DIES

Mon Dec 10 13:01:27 PST 2001

On Sat, Dec 08, 2001 at 12:36:25PM -0800, Peter Beerli wrote:

> Some time ago I asked about some problem with my mpi program and a scyld 
> beowulf cluster and got no real response to it.
> - did nobody every port a lam-mpi program onto a scyld-beowulf cluster?
> - did I miss the right keywords or what information is missing??

Well, here are a couple of clues.

1) If you really want to use gdb against the processes, and you can
convince your program to run with little enough memory that they all
fit on one CPU, you can:

export ALL_LOCAL=1
mpirun foo bar baz

and all the processes will run on the master. Attach gdb, enjoy.

bproc isn't a full enough emulation of /proc to run gdb remotely. If
you REALLY need to do that you can bpsh gdb to a remote node after
bpsh ps to find out the remote PID, etc etc. If you figure this out,
do write a script for it so everyone else don't have to deal with such
nasty details.

2) You can always add printfs to the program. In your case I would
suggest printing out the sent and received value of buffsize, and then
another printf after the actual data arrives. My guess is that you're
somehow stomping memory in a way that's different in LAM and MPICH,
and so the free() causes a coredump.

greg