Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] MPICH-1.2.5 hangs on 16 node cluster

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Greg Lindahl lindahl at pathscale.com
Sun Nov 21 10:10:43 PST 2004


On Fri, Nov 19, 2004 at 02:37:18PM +0530, Sreenivasulu Pulichintala wrote:

> I see some strange behavior of the MPICH stack when running on a 16 node
> cluster.

Is this stock MPICH? If not, you haven't included very much info about
what you're actually running. In any case:

> On node 2
> --------
> #0  0x0000000041efb877 in poll_rdma_buffer ()
> #1  0x0000000041efd2cb in viutil_spinandwaitcq ()
> #2  0x0000000041efba1e in MPID_DeviceCheck ()
> #3  0x0000000041f0a36b in MPID_RecvComplete ()
> #4  0x0000000041f09ead in MPID_RecvDatatype ()
> #5  0x0000000041f03569 in MPI_Recv ()
> #6  0x0000000041eef42d in mpi_recv_ ()
> #7  0x0000000041c0b153 in remdupslave_ ()
> #8  0x000000000000cf6b in ?? ()
> #9  0x000000000000c087 in ?? ()
> #10 0x000000000002f4b4 in ?? ()
> #11 0x000000000000c503 in ?? ()
> #12 0x000000000000c575 in ?? ()
> #13 0x000000000000040c in ?? ()
> #14 0x00000000401ae313 in dynai_ ()
> #15 0x0000000040006d08 in frame_dummy ()

This process seems to be in a Fortran mpi_recv() call and NOT in
All_Reduce. This could be a programming error in your program.
But it isn't clear if this stack trace isn't corrupt.

-- greg

p.s. It would be better if you posted to mailing lists in straight
text instead of text and html.




More information about the Beowulf mailing list