[Beowulf] An annoying MPI problem

Tue Jul 8 19:01:48 PDT 2008

Hi folks

   Dealing with an MPI problem that has me scratching my head.  Quite 
beowulfish, as thats where this code runs.

   Short version:  The code starts and runs.  Reads in its data.  Starts 
its iterations.  And then somewhere after this, it hangs.  But not 
always at the same place.  It doesn't write state data back out to the 
disk, just logs.  Rerunning it gets it to a different point, sometimes 
hanging sooner, sometimes later.  Seems to be the case on multiple 
different machines, with different OSes.  Working on comparing MPI 
distributions, and it hangs with IB as well as with shared memory and 
tcp sockets.

   Right now we are using OpenMPI 1.2.6, and this code does use 
allreduce.  When it hangs, an strace of the master process shows lots of 
polling:

c1-1:~ # strace -p 8548
Process 8548 attached - interrupt to quit
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b061f65c9b2, [CHLD], SA_RESTORER|SA_RESTART, 
0x2b062049b130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, 
events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b061f65c9b2, [CHLD], SA_RESTORER|SA_RESTART, 
0x2b062049b130}, NULL, 8) = 0

[spin forever]
...

So it looks like the process is waiting for the appropriate posting on 
the internal scoreboard, and just hanging in a tight loop until this 
actually happens.

But these hangs usually happen at the same place each time for a logic 
error.

This is what I have seen in the past from other MPI codes where you have 
enough sends and receives, but everyone posts their send before their 
receive ... ordering is important of course.

But the odd thing about this code is that it worked fine 12 - 18 months 
ago, and we haven't touched it since (nor has it changed).  What has 
changed is that we are now using OpenMPI 1.2.6.

So the code hasn't changed, and the OS on which it runs hasn't changed, 
but the MPI stack has.  Yeah, thats a clue.

Turning off openib and tcp doesn't make a great deal of impact.  This is 
also a clue.

I am looking now to trying mvapich2 and seeing how that goes.  Using 
Intel and gfortran compilers (Fortran/C mixed code).

Anyone see strange things like this with their MPI stacks?  OpenMPI? 
Mvapich2?  I should try the Intel MPI as well (rebuilt mvapich2 as I 
remember).

I'll try all the usual things (reduce the optimization level, etc). 
Sage words of advice (and clue sticks) welcome.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615