[Beowulf] MPICH-1.2.5 hangs on 16 node cluster
Sreenivasulu Pulichintala
sreenivasulu at topspin.com
Fri Nov 19 01:07:18 PST 2004
Hi,
I see some strange behavior of the MPICH stack when running on a 16 node
cluster. It goes to deadlock and hangs. On attaching the process through
gdb the following stack was observed.
-------------------------------------
on machine 1
--------------
#0 0x0000000041efb858 in poll_rdma_buffer ()
#1 0x0000000041efd2cb in viutil_spinandwaitcq ()
#2 0x0000000041efba1e in MPID_DeviceCheck ()
#3 0x0000000041f0a36b in MPID_RecvComplete ()
#4 0x0000000041f02fc4 in MPI_Waitall ()
#5 0x0000000041eee8fc in MPI_Sendrecv ()
#6 0x0000000041ef460e in intra_Allreduce ()
#7 0x0000000041eec62c in MPI_Allreduce ()
#8 0x0000000041eeeab9 in mpi_allreduce_ ()
#9 0x00000000401ae133 in dynai_ ()
#10 0x0000000040006d08 in frame_dummy ()
-Process 2--------
#0 0x0000000041f018f8 in smpi_net_lookup ()
#1 0x0000000041f0188b in MPID_SMP_Check_incoming ()
#2 0x0000000041efd2b6 in viutil_spinandwaitcq ()
#3 0x0000000041efba1e in MPID_DeviceCheck ()
#4 0x0000000041f0a36b in MPID_RecvComplete ()
#5 0x0000000041f02fc4 in MPI_Waitall ()
#6 0x0000000041eee8fc in MPI_Sendrecv ()
#7 0x0000000041ef460e in intra_Allreduce ()
#8 0x0000000041eec62c in MPI_Allreduce ()
#9 0x0000000041eeeab9 in mpi_allreduce_ ()
#10 0x00000000401ae133 in dynai_ ()
#11 0x0000000040006d08 in frame_dummy ()
------
On node 2
--------
#0 0x0000000041efb877 in poll_rdma_buffer ()
#1 0x0000000041efd2cb in viutil_spinandwaitcq ()
#2 0x0000000041efba1e in MPID_DeviceCheck ()
#3 0x0000000041f0a36b in MPID_RecvComplete ()
#4 0x0000000041f09ead in MPID_RecvDatatype ()
#5 0x0000000041f03569 in MPI_Recv ()
#6 0x0000000041eef42d in mpi_recv_ ()
#7 0x0000000041c0b153 in remdupslave_ ()
#8 0x000000000000cf6b in ?? ()
#9 0x000000000000c087 in ?? ()
#10 0x000000000002f4b4 in ?? ()
#11 0x000000000000c503 in ?? ()
#12 0x000000000000c575 in ?? ()
#13 0x000000000000040c in ?? ()
#14 0x00000000401ae313 in dynai_ ()
#15 0x0000000040006d08 in frame_dummy ()
--2nd process----
#0 0x0000000041efd2cb in viutil_spinandwaitcq ()
#1 0x0000000041efba1e in MPID_DeviceCheck ()
#2 0x0000000041f0a36b in MPID_RecvComplete ()
#3 0x0000000041f02fc4 in MPI_Waitall ()
#4 0x0000000041eee8fc in MPI_Sendrecv ()
#5 0x0000000041ef460e in intra_Allreduce ()
#6 0x0000000041eec62c in MPI_Allreduce ()
#7 0x0000000041eeeab9 in mpi_allreduce_ ()
#8 0x00000000401ae133 in dynai_ ()
#9 0x0000000040006d08 in frame_dummy ()
-----
Other machines processes seem to be caught up on MPI_Allreduce stack.
Has anyone experienced similar kind of problem?
Any help in this regard is highly appreciated.
Thanks
Sree
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20041119/183f8246/attachment.html>
More information about the Beowulf
mailing list