[Beowulf] MPICH-1.2.5 hangs on 16 node cluster

Sreenivasulu Pulichintala sreenivasulu at topspin.com
Fri Nov 19 01:07:18 PST 2004


Hi,

 

I see some strange behavior of the MPICH stack when running on a 16 node
cluster. It goes to deadlock and hangs. On attaching the process through
gdb the following stack was observed.
-------------------------------------
on machine 1
--------------
#0  0x0000000041efb858 in poll_rdma_buffer ()
#1  0x0000000041efd2cb in viutil_spinandwaitcq ()
#2  0x0000000041efba1e in MPID_DeviceCheck ()
#3  0x0000000041f0a36b in MPID_RecvComplete ()
#4  0x0000000041f02fc4 in MPI_Waitall ()
#5  0x0000000041eee8fc in MPI_Sendrecv ()
#6  0x0000000041ef460e in intra_Allreduce ()
#7  0x0000000041eec62c in MPI_Allreduce ()
#8  0x0000000041eeeab9 in mpi_allreduce_ ()
#9  0x00000000401ae133 in dynai_ ()
#10 0x0000000040006d08 in frame_dummy ()
-Process 2--------
#0  0x0000000041f018f8 in smpi_net_lookup ()
#1  0x0000000041f0188b in MPID_SMP_Check_incoming ()
#2  0x0000000041efd2b6 in viutil_spinandwaitcq ()
#3  0x0000000041efba1e in MPID_DeviceCheck ()
#4  0x0000000041f0a36b in MPID_RecvComplete ()
#5  0x0000000041f02fc4 in MPI_Waitall ()
#6  0x0000000041eee8fc in MPI_Sendrecv ()
#7  0x0000000041ef460e in intra_Allreduce ()
#8  0x0000000041eec62c in MPI_Allreduce ()
#9  0x0000000041eeeab9 in mpi_allreduce_ ()
#10 0x00000000401ae133 in dynai_ ()
#11 0x0000000040006d08 in frame_dummy ()
 
------
 
On node 2
--------
#0  0x0000000041efb877 in poll_rdma_buffer ()
#1  0x0000000041efd2cb in viutil_spinandwaitcq ()
#2  0x0000000041efba1e in MPID_DeviceCheck ()
#3  0x0000000041f0a36b in MPID_RecvComplete ()
#4  0x0000000041f09ead in MPID_RecvDatatype ()
#5  0x0000000041f03569 in MPI_Recv ()
#6  0x0000000041eef42d in mpi_recv_ ()
#7  0x0000000041c0b153 in remdupslave_ ()
#8  0x000000000000cf6b in ?? ()
#9  0x000000000000c087 in ?? ()
#10 0x000000000002f4b4 in ?? ()
#11 0x000000000000c503 in ?? ()
#12 0x000000000000c575 in ?? ()
#13 0x000000000000040c in ?? ()
#14 0x00000000401ae313 in dynai_ ()
#15 0x0000000040006d08 in frame_dummy ()
--2nd process----
#0  0x0000000041efd2cb in viutil_spinandwaitcq ()
#1  0x0000000041efba1e in MPID_DeviceCheck ()
#2  0x0000000041f0a36b in MPID_RecvComplete ()
#3  0x0000000041f02fc4 in MPI_Waitall ()
#4  0x0000000041eee8fc in MPI_Sendrecv ()
#5  0x0000000041ef460e in intra_Allreduce ()
#6  0x0000000041eec62c in MPI_Allreduce ()
#7  0x0000000041eeeab9 in mpi_allreduce_ ()
#8  0x00000000401ae133 in dynai_ ()
#9  0x0000000040006d08 in frame_dummy ()
-----
 

Other machines processes seem to be caught up on MPI_Allreduce stack.

 

Has anyone experienced similar kind of problem?

 

Any help in this regard is highly appreciated.

 

Thanks

Sree

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20041119/183f8246/attachment.html>


More information about the Beowulf mailing list