[Beowulf] An annoying MPI problem

Wed Jul 9 15:58:35 PDT 2008

Try disabling shared memory only.
Open MPI shared memory buffer is limited and it enters deadlock if you
overflow it.
As Open MPI uses busy wait, it appears as a livelock.

2008/7/9 Ashley Pittman <apittman at concurrent-thinking.com>:

> On Tue, 2008-07-08 at 22:01 -0400, Joe Landman wrote:
> >    Short version:  The code starts and runs.  Reads in its data.  Starts
> > its iterations.  And then somewhere after this, it hangs.  But not
> > always at the same place.  It doesn't write state data back out to the
> > disk, just logs.  Rerunning it gets it to a different point, sometimes
> > hanging sooner, sometimes later.  Seems to be the case on multiple
> > different machines, with different OSes.  Working on comparing MPI
> > distributions, and it hangs with IB as well as with shared memory and
> > tcp sockets.
>
> Sounds like you've found a bug, doesn't sound too difficult to find,
> comments in-line.
>
> >    Right now we are using OpenMPI 1.2.6, and this code does use
> > allreduce.  When it hangs, an strace of the master process shows lots of
> > polling:
>
> Why do you mention allreduce, does it tend to be in allreduce when it
> hangs?  Is it happening at the same place but on a different iteration
> every time perhaps?  This is quite important, you could either have a
> "random" memory corruption which can cause the program to stop anywhere
> and are often hard to find or a race condition which is easier to deal
> with, if there are any similarities in the stack then it tends to point
> to the latter.
>
> allreduce is one of the collective functions with an implicit barrier
> which means that *no* process can return from it until *all* processes
> have called it, if you program uses allreduce extensively it's entirely
> possible that one process has stopped for whatever reason and have the
> rest continued as far as they can until they too deadlock.  Collectives
> often get accused of causing programs to hang when in reality N-1
> processes are in the collective call and 1 is off somewhere else.
>
> > c1-1:~ # strace -p 8548
>
> > [spin forever]
>
> Any chance of a stack trace, preferably a parallel one?  I assume *all*
> processes in the job are in the R state?  Do you have a mechanism
> available to allow you to see the message queues?
>
> > So it looks like the process is waiting for the appropriate posting on
> > the internal scoreboard, and just hanging in a tight loop until this
> > actually happens.
> >
> > But these hangs usually happen at the same place each time for a logic
> > error.
>
> Like in allreduce you mean?
>
> > But the odd thing about this code is that it worked fine 12 - 18 months
> > ago, and we haven't touched it since (nor has it changed).  What has
> > changed is that we are now using OpenMPI 1.2.6.
>
> The other important thing to know here is what you have changed *from*.
>
> > So the code hasn't changed, and the OS on which it runs hasn't changed,
> > but the MPI stack has.  Yeah, thats a clue.
>
> > Turning off openib and tcp doesn't make a great deal of impact.  This is
> > also a clue.
>
> So it's likely algorithmic?  You could turn off shared memory as well
> but it won't make a great deal of impact so there isn't any point.
>
> > I am looking now to trying mvapich2 and seeing how that goes.  Using
> > Intel and gfortran compilers (Fortran/C mixed code).
> >
> > Anyone see strange things like this with their MPI stacks?
>
> All the time, it's not really strange, just what happens on large
> systems, expecially when developing MPI or applications.
>
> > I'll try all the usual things (reduce the optimization level, etc).
> > Sage words of advice (and clue sticks) welcome.
>
> Is it the application which hangs or a combination of the application
> and the dataset you give it?  What's the smallest process count and
> timescale you can reproduce this on?
>
> You could try valgrind which works well with openmpi, it will help you
> with memory corruption but won't help be of much help if you have a race
> condition.  Going by reputation Marmot might be of some use, it'll point
> out if you are doing anything silly with MPI calls, there is enough
> flexibility in the standard that you can do something completely illegal
> but have it work in 90% of cases, marmot should pick up on these.
> http://www.hlrs.de/organization/amt/projects/marmot/
>
> We could take this off-line if you prefer, this could potentially get
> quite involved...
>
> Ashley Pittman.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080709/ba4fdfaa/attachment.html>