MPICH, malloc, and my impending assault of one (1) beowulf cluste r
mundy erik
erik.mundy at HAMPTONU.EDU
Wed Jul 18 12:34:53 PDT 2001
Hello, my name is Erik, and I am an MPICH abuser.
I am running a simple one master, two slave Beowulf test cluster,
RHL 6.1, kernel 2.4.4, MPICH 1.2.1, NFS mount from master to slave on old
PII 400's. MPICH is giving me some serious headaches - every MPI program I
execute with a malloc in it crashes with the good old "p4 error: interrupt
SIGSEGV: 11" message. I have been experimenting with the test programs that
come with MPICH for simplicity; for example, 'cpi' runs well on all three
computers. It
calculates pi, and I rejoice. Mpptest also works without a problem between
any two of the three computers. But when I try to mpirun "sendrecv" or
"overtake" from examples/test/pt2pt (both of which use a malloc), MPICH
gives it the good old college try and then throws me the errors. Normally I
would just try to do as much as humanly possible to ignore this problem, but
the code that this beowulf was designed for works when I execute it on one
computer, and crashes rather spectacularly with the segmentation violation
error when I try to mpirun it, even on just one computer, leading me to
think that there is some sort of conflict between MPICH and malloc.
Granted, these computers aren't exactly state-of-the-art - each has
only 128M ram with ~400M swap. But that should be more than enough to
execute those simple examples. Has anyone had trouble with the Linux
version of malloc in the past in a situation like this? If you shudder when
you hear the words "malloc" and "MPICH" used in the same sentence, please
email me back. This might be a bit difficult to track down, and I'm really
not the best man for the job, all I did was build a beowulf :). I've only
been on this list for the last two months but it's taught me that if anyone
can help its probably you guys. I am EXTREMELY appreciative of any
assistance you can offer.
Thanks,
Erik
erik.mundy at hamptonu.edu
PS - also, I should mention that yes, the code I am trying to run WAS
designed for use with MPI, and yes, I did patch MPICH with the bug fixes
from the Argonne page. Sorry to take the obvious 'he's so dumb!' solutions
away... I'm hoping there's one more that maybe I'm just missing :)
More information about the Beowulf
mailing list