[Beowulf] An annoying MPI problem
apittman at concurrent-thinking.com
Wed Jul 9 14:25:27 PDT 2008
On Tue, 2008-07-08 at 22:01 -0400, Joe Landman wrote:
> Short version: The code starts and runs. Reads in its data. Starts
> its iterations. And then somewhere after this, it hangs. But not
> always at the same place. It doesn't write state data back out to the
> disk, just logs. Rerunning it gets it to a different point, sometimes
> hanging sooner, sometimes later. Seems to be the case on multiple
> different machines, with different OSes. Working on comparing MPI
> distributions, and it hangs with IB as well as with shared memory and
> tcp sockets.
Sounds like you've found a bug, doesn't sound too difficult to find,
> Right now we are using OpenMPI 1.2.6, and this code does use
> allreduce. When it hangs, an strace of the master process shows lots of
Why do you mention allreduce, does it tend to be in allreduce when it
hangs? Is it happening at the same place but on a different iteration
every time perhaps? This is quite important, you could either have a
"random" memory corruption which can cause the program to stop anywhere
and are often hard to find or a race condition which is easier to deal
with, if there are any similarities in the stack then it tends to point
to the latter.
allreduce is one of the collective functions with an implicit barrier
which means that *no* process can return from it until *all* processes
have called it, if you program uses allreduce extensively it's entirely
possible that one process has stopped for whatever reason and have the
rest continued as far as they can until they too deadlock. Collectives
often get accused of causing programs to hang when in reality N-1
processes are in the collective call and 1 is off somewhere else.
> c1-1:~ # strace -p 8548
> [spin forever]
Any chance of a stack trace, preferably a parallel one? I assume *all*
processes in the job are in the R state? Do you have a mechanism
available to allow you to see the message queues?
> So it looks like the process is waiting for the appropriate posting on
> the internal scoreboard, and just hanging in a tight loop until this
> actually happens.
> But these hangs usually happen at the same place each time for a logic
Like in allreduce you mean?
> But the odd thing about this code is that it worked fine 12 - 18 months
> ago, and we haven't touched it since (nor has it changed). What has
> changed is that we are now using OpenMPI 1.2.6.
The other important thing to know here is what you have changed *from*.
> So the code hasn't changed, and the OS on which it runs hasn't changed,
> but the MPI stack has. Yeah, thats a clue.
> Turning off openib and tcp doesn't make a great deal of impact. This is
> also a clue.
So it's likely algorithmic? You could turn off shared memory as well
but it won't make a great deal of impact so there isn't any point.
> I am looking now to trying mvapich2 and seeing how that goes. Using
> Intel and gfortran compilers (Fortran/C mixed code).
> Anyone see strange things like this with their MPI stacks?
All the time, it's not really strange, just what happens on large
systems, expecially when developing MPI or applications.
> I'll try all the usual things (reduce the optimization level, etc).
> Sage words of advice (and clue sticks) welcome.
Is it the application which hangs or a combination of the application
and the dataset you give it? What's the smallest process count and
timescale you can reproduce this on?
You could try valgrind which works well with openmpi, it will help you
with memory corruption but won't help be of much help if you have a race
condition. Going by reputation Marmot might be of some use, it'll point
out if you are doing anything silly with MPI calls, there is enough
flexibility in the standard that you can do something completely illegal
but have it work in 90% of cases, marmot should pick up on these.
We could take this off-line if you prefer, this could potentially get
More information about the Beowulf