<html>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 10 (filtered)">
<style>
<!--
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman";}
a:link, span.MsoHyperlink
{color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{color:purple;
text-decoration:underline;}
pre
{margin:0in;
margin-bottom:.0001pt;
font-size:10.0pt;
font-family:"Courier New";}
span.EmailStyle17
{font-family:Arial;
color:windowtext;}
@page Section1
{size:8.5in 11.0in;
margin:1.0in 1.25in 1.0in 1.25in;}
div.Section1
{page:Section1;}
-->
</style>
</head>
<body lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'>Hi,</span></font></p>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'> </span></font></p>
<pre><font size=2 face="Courier New"><span style='font-size:10.0pt'>I see some strange behavior of the MPICH stack when running on a 16 node cluster. It goes to deadlock and hangs. On attaching the process through gdb the following stack was observed.</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>-------------------------------------</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>on machine 1</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>--------------</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#0 0x0000000041efb858 in poll_rdma_buffer ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#1 0x0000000041efd2cb in viutil_spinandwaitcq ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#2 0x0000000041efba1e in MPID_DeviceCheck ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#3 0x0000000041f0a36b in MPID_RecvComplete ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#4 0x0000000041f02fc4 in MPI_Waitall ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#5 0x0000000041eee8fc in MPI_Sendrecv ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#6 0x0000000041ef460e in intra_Allreduce ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#7 0x0000000041eec62c in MPI_Allreduce ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#8 0x0000000041eeeab9 in mpi_allreduce_ ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#9 0x00000000401ae133 in dynai_ ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#10 0x0000000040006d08 in frame_dummy ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>-Process 2--------</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#0 0x0000000041f018f8 in smpi_net_lookup ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#1 0x0000000041f0188b in MPID_SMP_Check_incoming ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#2 0x0000000041efd2b6 in viutil_spinandwaitcq ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#3 0x0000000041efba1e in MPID_DeviceCheck ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#4 0x0000000041f0a36b in MPID_RecvComplete ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#5 0x0000000041f02fc4 in MPI_Waitall ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#6 0x0000000041eee8fc in MPI_Sendrecv ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#7 0x0000000041ef460e in intra_Allreduce ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#8 0x0000000041eec62c in MPI_Allreduce ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#9 0x0000000041eeeab9 in mpi_allreduce_ ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#10 0x00000000401ae133 in dynai_ ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#11 0x0000000040006d08 in frame_dummy ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'> </span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>------</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'> </span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>On node 2</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>--------</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#0 0x0000000041efb877 in poll_rdma_buffer ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#1 0x0000000041efd2cb in viutil_spinandwaitcq ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#2 0x0000000041efba1e in MPID_DeviceCheck ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#3 0x0000000041f0a36b in MPID_RecvComplete ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#4 0x0000000041f09ead in MPID_RecvDatatype ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#5 0x0000000041f03569 in MPI_Recv ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#6 0x0000000041eef42d in mpi_recv_ ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#7 0x0000000041c0b153 in remdupslave_ ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#8 0x000000000000cf6b in ?? ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#9 0x000000000000c087 in ?? ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#10 0x000000000002f4b4 in ?? ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#11 0x000000000000c503 in ?? ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#12 0x000000000000c575 in ?? ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#13 0x000000000000040c in ?? ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#14 0x00000000401ae313 in dynai_ ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#15 0x0000000040006d08 in frame_dummy ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>--2nd process----</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#0 0x0000000041efd2cb in viutil_spinandwaitcq ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#1 0x0000000041efba1e in MPID_DeviceCheck ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#2 0x0000000041f0a36b in MPID_RecvComplete ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#3 0x0000000041f02fc4 in MPI_Waitall ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#4 0x0000000041eee8fc in MPI_Sendrecv ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#5 0x0000000041ef460e in intra_Allreduce ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#6 0x0000000041eec62c in MPI_Allreduce ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#7 0x0000000041eeeab9 in mpi_allreduce_ ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#8 0x00000000401ae133 in dynai_ ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>#9 0x0000000040006d08 in frame_dummy ()</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'>-----</span></font></pre><pre><font
size=2 face="Courier New"><span style='font-size:10.0pt'> </span></font></pre>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'>Other machines processes seem to be caught up on MPI_Allreduce stack.</span></font></p>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'> </span></font></p>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'>Has anyone experienced similar kind of problem?</span></font></p>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'> </span></font></p>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'>Any help in this regard is highly appreciated.</span></font></p>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'> </span></font></p>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'>Thanks</span></font></p>
<p class=MsoNormal><font size=3 face="Times New Roman"><span style='font-size:
12.0pt'>Sree</span></font></p>
</div>
</body>
</html>