[Beowulf] MPI_Isend/Irecv failure for IB and large message sizes

Mon Nov 16 19:40:51 PST 2009

Hi Martin

I tried your program with the four combinations of
IB and TCP/IP, mcmodel small and medium.
I lazily didn't recompile OpenMPI (1.3.2) with mcmodel=medium,
just the program, hence this is not a very clean test.

FYI, we have dual-socket quad-core AMD Opteron
nodes with 16GB RAM each.
OpenMPI 1.3.2, CentOS 5.2, gcc 4.1.2, OFED 1.4.

When I ran on 2 nodes and 16 processes the program would always fail
with segmentation fault / address not mapped on all four
combinations above.

However, when I ran on 2 nodes and 2 processes ( -bynode flag in
use to direct each process to a separate node) then it
worked over all four combinations!

Here is the IB+medium stderr (you printed to stderr):
id=1: calling irecv ...
id=0: calling isend ...

and the corresponding stdout:
...
id=0: isend/irecv completed 1.954140
id=1: isend/irecv completed 4.192037

This rules out a problem with memory model, I suppose.
Small is good enough for your message size,
as long as there is enough RAM for all processes,
MPI overhead, etc.

Also, as Don Holmgren already pointed out to you,
make sure your limits are properly set on the nodes.
For instance, we use Torque, and we put these settings
on the nodes' /etc/init.d/pbs_mom:

ulimit -n 32768
ulimit -s unlimited
ulimit -l unlimited

Just like Don, we've been burned by this before, when using the
vendor original setup.
Of course these limits can be set in other ways.

As a practical matter:

Would it be possible/desirable to reduce the message size,
splitting the huge message into several smaller ones?
I know the wisdom is that one big message is better
than many small ones, but here we're talking about huge,
not big, and sizable, not small.

Even your tiny test program takes a detectable time to run
(4s+ seconds on IB, 14s+ on TCP/IP).
It may be worth writing another version of it looping over
smaller messages,
and do some timing tests to compare with the huge
message version.
There may be a sweet spot for the message size vs. number of
messages, I would guess.
Big may not always be better.

In the past a user here had a program sending very large messages
(big 3D arrays).
Not so big as to hit the 2GB threshold, but big enough to
slow down the nodes and the cluster.
Rewriting the program to loop over smaller messages
(2D array slices) solved the problem.
I remember other threads in the MPICH and OpenMPI
mailing lists that reported difficulties with huge messages.

My $0.02
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

Martin Siegert wrote:
> Hi,
> 
> On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote:
>> Hi Martin
>>
>> We didn't know which compiler you used.
>> So what Michael sent you ("mmodel=memory_model")
>> is the Intel compiler flag syntax.
>> (PGI uses the same syntax, IIRR.)
> 
> Now that was really stupid, I am using gcc-4.3.2 and even looked up
> the correct syntax for the memory model, but nevertheless pasted the
> Intel syntax into my configure script ... sorry.
> 
>> Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture.
>> I only used this with Intel ifort, hence I am not sure,
>> but "medium" should work fine for large data/not-so-large program
>> in gcc/gfortran.
>> The "large" model doesn't seem to be implemented by gcc (4.1.2)
>> anyway.
>> (Maybe it is there in newer gcc versions.)
>> The darn thing is that gcc says "medium" doesn't support building
>> shared libraries,
>> hence you may need to build OpenMPI static libraries instead,
>> I would guess.
>> (Again, check this if you have a newer gcc version.)
>> Here's an excerpt of my gcc (4.1.2) man page:
>>
>>
>>        -mcmodel=small
>>             Generate code for the small code model: the program and its 
>> symbols must be linked in the lower 2 GB of the address space.  Pointers 
>> are 64 bits.  Pro-
>>            grams can be statically or dynamically linked.  This is the 
>> default code model.
>>
>>        -mcmodel=kernel
>>            Generate code for the kernel code model.  The kernel runs in the 
>> negative 2 GB of the address space.  This model has to be used for Linux 
>> kernel code.
>>
>>        -mcmodel=medium
>>            Generate code for the medium model: The program is linked in the 
>> lower 2 GB of the address space but symbols can be located anywhere in the 
>> address
>>            space.  Programs can be statically or dynamically linked, but 
>> building of shared libraries are not supported with the medium model.
>>
>>        -mcmodel=large
>>            Generate code for the large model: This model makes no 
>> assumptions about addresses and sizes of sections.  Currently GCC does not 
>> implement this model.
> 
> I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program
> still fails. The error message changes, however:
> 
> id=1: calling irecv ...
> id=0: calling isend ...
> mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e)
> [[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046  vendor error 112 qp_idx 3
> 
> (strerror(112) is "Host is down", which is certainly not correct).
> This now points to system libraries - libmlx4. Am I correct in assuming that
> this is either an OFED problem or OpenMPI exceeding some buffers in OFED
> libraries without checking?
> 
>> If you are using OpenMPI, "ompi-info -config"
>> will tell the flags used to compile it.
>> Mine is 1.3.2 and has no explicit mcmodel flag,
>> which according to the gcc man page should default to "small".
> 
> Are you - in fact, is anybody - able to run my test program? I am
> hoping that there is some stupid misconfiguration on the cluster
> that can be fixed easily, without reinstalling/recompiling all
> apps ...
> 
>> Are you using 16GB per process or for the whole set of processes?
> 
> I am running the two processes on different nodes (and nothing else
> on the nodes), thus each process has the full 16GB available.
>> I hope this helps,
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
> 
> Thanks!
> 
> - Martin
> 
>> Martin Siegert wrote:
>>> Hi Michael,
>>>
>>> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:
>>>> Martin,
>>>>
>>>> Could it be that your MPI library was compiled using a small memory 
>>>> model?  The 180 million doubles sounds suspiciously close to a 2 GB 
>>>> addressing limit.
>>>>
>>>> This issue came up on the list recently under the topic "Fortran Array 
>>>> size question."
>>>>
>>>>
>>>> Mike
>>> I am running MPI applications that use more than 16GB of memory - I do not 
>>> believe that this is the problem. Also -mmodel=large
>>> does not appear to be a valid argument for gcc under x86_64:
>>> gcc -DNDEBUG -g -fPIC -mmodel=large   conftest.c  >&5
>>> cc1: error: unrecognized command line option "-mmodel=large"
>>>
>>> - Martin
>>>
>>>> At 05:43 PM 11/14/2009, Martin Siegert wrote:
>>>>> Hi,
>>>>>
>>>>> I am running into problems when sending large messages (about
>>>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>>>>
>>>>> # mpicc -g sendrecv.c
>>>>> # mpiexec -machinefile m2 -n 2 ./a.out
>>>>> id=1: calling irecv ...
>>>>> id=0: calling isend ...
>>>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 
>>>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for 
>>>>> wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>>>>
>>>>> This is with OpenMPI-1.3.3.
>>>>> Does anybody know a solution to this problem?
>>>>>
>>>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
>>>>> and never returns.
>>>>> I asked on the openmpi users list but got no response ...
>>>>>
>>>>> Cheers,
>>>>> Martin
>>>>>
>>>>> --
>>>>> Martin Siegert
>>>>> Head, Research Computing
>>>>> WestGrid Site Lead
>>>>> IT Services                                phone: 778 782-4691
>>>>> Simon Fraser University                    fax:   778 782-4242
>>>>> Burnaby, British Columbia                  email: siegert at sfu.ca
>>>>> Canada  V5A 1S6
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>