[Beowulf] MPI_Isend/Irecv failure for IB and large message sizes
siegert at sfu.ca
Mon Nov 16 13:01:02 PST 2009
On Sun, Nov 15, 2009 at 02:29:13PM -0800, Michael Di Domenico wrote:
> you might want to ask on the linux-rdma list (was openfabrics). its
> been awhile since i looked at IB error messages, but what
> stack/version are you running?
This is under Scientific Linux 5.3 which is a RH 5.3 clone that comes
with OFED-1.3.2, which admittedly is quite old. Unfortunately,
upgrading this is a major forklift ... thus I must be sure that this is
really the problem. I'll do a few tests on a couple of nodes ...
> On Sat, Nov 14, 2009 at 4:43 PM, Martin Siegert <siegert at sfu.ca> wrote:
> > Hi,
> > I am running into problems when sending large messages (about
> > 180000000 doubles) over IB. A fairly trivial example program is attached.
> > # mpicc -g sendrecv.c
> > # mpiexec -machinefile m2 -n 2 ./a.out
> > id=1: calling irecv ...
> > id=0: calling isend ...
> > [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813 vendor error 105 qp_idx 3
> > This is with OpenMPI-1.3.3.
> > Does anybody know a solution to this problem?
> > If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
> > and never returns.
> > I asked on the openmpi users list but got no response ...
> > Cheers,
> > Martin
> > --
> > Martin Siegert
> > Head, Research Computing
> > WestGrid Site Lead
> > IT Services phone: 778 782-4691
> > Simon Fraser University fax: 778 782-4242
> > Burnaby, British Columbia email: siegert at sfu.ca
> > Canada V5A 1S6
More information about the Beowulf