[Beowulf] Explanation of error message in MPICH-1.2.7
Kevin Ball
kball at pathscale.com
Fri Oct 6 13:46:54 PDT 2006
Hi Jeff,
On Fri, 2006-10-06 at 12:50, Jeffrey B. Layton wrote:
> Afternoon cluster fans,
>
> I'm working with a CFD code using the PGI 6.1 compilers and
> MPICH-1.2.7. The code runs fine for a while but I get an error
> message that I've never seen before:
>
>
> [2] MPI Internal Aborting program Deep nest in Check_incoming
> [2] Deep nest in Check_incoming
I've seen this before. I think I remember finding it to be a bug in
MPICH.... hmm...
by grepping through the source I have found this in mpid/ch_p4/chckdev.c
/* There is an implementation bug in the flow control code that
can lead to an infinite nest of calls to this routine.
Rather than allow the code to hang, we abort if the nesting
level gets too deep */
if (nest_level++ > MAX_CHECKDEVICE_NEST) {
MPID_Abort( 0, 1, "MPI Internal", "Deep nest in Check_incoming"
);
}
Sorry that this isn't helpful in fixing things.
-Kevin
>
> This error message is in the error file from PBS. The output from
> the code gives the following:
>
>
> p2_15458: p4_error: : 1
> p5_21530: p4_error: net_recv read: probable EOF on socket: 1
> p7_21548: p4_error: net_recv read: probable EOF on socket: 1
> p6_21539: p4_error: net_recv read: probable EOF on socket: 1
> rm_l_6_21544: (95.492188) net_send: could not write to fd=5, errno = 32
> rm_l_2_15464: (95.835938) net_send: could not write to fd=5, errno = 32
> rm_l_5_21535: (95.574219) net_send: could not write to fd=5, errno = 32
> rm_l_7_21553: (95.410156) net_send: could not write to fd=5, errno = 32
>
>
> The code runs fine with other MPI implementations (Scali,
> MVAPICH, etc.) My googling efforts haven't yielded anything.
> Does anyone have any input on this?
>
> Thanks!
>
> Jeff
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list