[Beowulf] MPI_Isend/Irecv failure for IB and large message sizes

Gus Correa gus at ldeo.columbia.edu
Mon Nov 16 22:26:52 PST 2009

Hi Martin

Answers/comments inline below

Martin Siegert wrote:
> Hi Gus,
> On Mon, Nov 16, 2009 at 10:40:51PM -0500, Gus Correa wrote:
>> Hi Martin
>> I tried your program with the four combinations of
>> IB and TCP/IP, mcmodel small and medium.
>> I lazily didn't recompile OpenMPI (1.3.2) with mcmodel=medium,
>> just the program, hence this is not a very clean test.
>> FYI, we have dual-socket quad-core AMD Opteron
>> nodes with 16GB RAM each.
>> OpenMPI 1.3.2, CentOS 5.2, gcc 4.1.2, OFED 1.4.
> We have dual-socket quad-core Intel E5430, 16GB,
> OpenMPI-1.3.3, SL 5.3, gcc 4.3.2 (and a bunch of other compilers,
> but gcc-4.3.2 is used to compile OpenMPI), OFED-1.3.2 (tested
> OFED-1.4.1 on two test nodes).
>> When I ran on 2 nodes and 16 processes the program would always fail
>> with segmentation fault / address not mapped on all four
>> combinations above.
>> However, when I ran on 2 nodes and 2 processes ( -bynode flag in
>> use to direct each process to a separate node) then it
>> worked over all four combinations!
>> Here is the IB+medium stderr (you printed to stderr):
>> id=1: calling irecv ...
>> id=0: calling isend ...
>> and the corresponding stdout:
>> ...
>> id=0: isend/irecv completed 1.954140
>> id=1: isend/irecv completed 4.192037
> Thanks!!
> Now I am surprised ... this always fails here.
> What's the difference?

The software stack is not the same, neither the hardware.
But I would guess they are not so far apart to make the difference.

Have you tried to run on TCP/IP?
Say, using:

         -mca btl tcp,sm,self \

and perhaps
	-mca btl_tcp_if_exclude lo,eth[0,1]
	-mca btl_tcp_if_include eth[0,1]
to select the Ethernet port?

I would guess you have at least one Ethernet network
to test the program over TCP/IP.
If it works on TCP/IP,
then the problem is likely to reside within IB.
(Maybe in OFED-1.3.2?)

>> This rules out a problem with memory model, I suppose.
>> Small is good enough for your message size,
>> as long as there is enough RAM for all processes,
>> MPI overhead, etc.
>> Also, as Don Holmgren already pointed out to you,
>> make sure your limits are properly set on the nodes.
>> For instance, we use Torque, and we put these settings
>> on the nodes' /etc/init.d/pbs_mom:
>> ulimit -n 32768
>> ulimit -s unlimited
>> ulimit -l unlimited
>> Just like Don, we've been burned by this before, when using the
>> vendor original setup.
>> Of course these limits can be set in other ways.
> I have been running this on the two test nodes without going through
> torque to avoid exactly these kind of problems.
> Anyway, I just ran the same program through torque, ran "ulimit -a"
> in the pbs script (all looks fine), but the program still fails.
>> As a practical matter:
>> Would it be possible/desirable to reduce the message size,
>> splitting the huge message into several smaller ones?
>> I know the wisdom is that one big message is better
>> than many small ones, but here we're talking about huge,
>> not big, and sizable, not small.
>> Even your tiny test program takes a detectable time to run
>> (4s+ seconds on IB, 14s+ on TCP/IP).
>> It may be worth writing another version of it looping over
>> smaller messages,
>> and do some timing tests to compare with the huge
>> message version.
>> There may be a sweet spot for the message size vs. number of
>> messages, I would guess.
>> Big may not always be better.
>> In the past a user here had a program sending very large messages
>> (big 3D arrays).
>> Not so big as to hit the 2GB threshold, but big enough to
>> slow down the nodes and the cluster.
>> Rewriting the program to loop over smaller messages
>> (2D array slices) solved the problem.
>> I remember other threads in the MPICH and OpenMPI
>> mailing lists that reported difficulties with huge messages.
>> My $0.02
>> Gus Correa
>> ---------------------------------------------------------------------
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> ---------------------------------------------------------------------
> In principle, yes ... I already wrote wrapper functions
> myMPI_Isend, myMPI_Irecv that do exactly that.
> However, we are talking about one of those quantum chemistry
> programs: many thousands of lines ... I'd really like to avoid
> this.
> - Martin

A few days ago somebody posted here a tip on how to run VASP in
a more scalable/efficient way by just choosing some internal
code parameters (probably available through a mere namelist).
This was after a long discussion here on how to make VASP
more scalable by tweaking with OpenMPI MCA parameters, etc, etc.

Would your user be willing to take a look at the code documentation
and find out if there is a way to decompose his domain, or matrix,
or problem, or whatever, in a more sensible
(and hopefully scalable) way?
Often times there is.
These programs are not necessarily poorly designed,
but users need read the documentation (or articles about the method)
to find out how to use them right.
A knowledgeable user should understand what the mathematical
method and the algorithm are doing, or at least be willing
to learn the basics of them.

Unless the problem itself is huge, passing an array of 180 million
doubles doesn't sound reasonable,just a brute force approach,
particularly if only two processes are sharing the work,
if you don't mind my saying that.
And if the problem is huge, one could argue that more nodes/processes
and smaller messages could be used to get the job done better.

We're mostly a climate, atmosphere, ocean shop, but this doesn't
mean that we are protected from this type of problem either.

Just a suggestion.
Gus Correa
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA

>> Martin Siegert wrote:
>>> Hi,
>>> On Mon, Nov 16, 2009 at 04:55:51PM -0500, Gus Correa wrote:
>>>> Hi Martin
>>>> We didn't know which compiler you used.
>>>> So what Michael sent you ("mmodel=memory_model")
>>>> is the Intel compiler flag syntax.
>>>> (PGI uses the same syntax, IIRR.)
>>> Now that was really stupid, I am using gcc-4.3.2 and even looked up
>>> the correct syntax for the memory model, but nevertheless pasted the
>>> Intel syntax into my configure script ... sorry.
>>>> Gcc/gfortran use "-mcmodel=memory_model" for x86_64 architecture.
>>>> I only used this with Intel ifort, hence I am not sure,
>>>> but "medium" should work fine for large data/not-so-large program
>>>> in gcc/gfortran.
>>>> The "large" model doesn't seem to be implemented by gcc (4.1.2)
>>>> anyway.
>>>> (Maybe it is there in newer gcc versions.)
>>>> The darn thing is that gcc says "medium" doesn't support building
>>>> shared libraries,
>>>> hence you may need to build OpenMPI static libraries instead,
>>>> I would guess.
>>>> (Again, check this if you have a newer gcc version.)
>>>> Here's an excerpt of my gcc (4.1.2) man page:
>>>>        -mcmodel=small
>>>>             Generate code for the small code model: the program and its 
>>>> symbols must be linked in the lower 2 GB of the address space.  Pointers 
>>>> are 64 bits.  Pro-
>>>>            grams can be statically or dynamically linked.  This is the 
>>>> default code model.
>>>>        -mcmodel=kernel
>>>>            Generate code for the kernel code model.  The kernel runs in 
>>>> the negative 2 GB of the address space.  This model has to be used for 
>>>> Linux kernel code.
>>>>        -mcmodel=medium
>>>>            Generate code for the medium model: The program is linked in 
>>>> the lower 2 GB of the address space but symbols can be located anywhere 
>>>> in the address
>>>>            space.  Programs can be statically or dynamically linked, but 
>>>> building of shared libraries are not supported with the medium model.
>>>>        -mcmodel=large
>>>>            Generate code for the large model: This model makes no 
>>>> assumptions about addresses and sizes of sections.  Currently GCC does 
>>>> not implement this model.
>>> I recompiled openmpi with -mcmodel=medium and -mcmodel=large. The program
>>> still fails. The error message changes, however:
>>> id=1: calling irecv ...
>>> id=0: calling isend ...
>>> mlx4: local QP operation err (QPN 340052, WQE index 0, vendor syndrome 70, opcode = 5e)
>>> [[55365,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL QP OPERATION ERROR status number 2 for wr_id 282498416 opcode 11046  vendor error 112 qp_idx 3
>>> (strerror(112) is "Host is down", which is certainly not correct).
>>> This now points to system libraries - libmlx4. Am I correct in assuming that
>>> this is either an OFED problem or OpenMPI exceeding some buffers in OFED
>>> libraries without checking?
>>>> If you are using OpenMPI, "ompi-info -config"
>>>> will tell the flags used to compile it.
>>>> Mine is 1.3.2 and has no explicit mcmodel flag,
>>>> which according to the gcc man page should default to "small".
>>> Are you - in fact, is anybody - able to run my test program? I am
>>> hoping that there is some stupid misconfiguration on the cluster
>>> that can be fixed easily, without reinstalling/recompiling all
>>> apps ...
>>>> Are you using 16GB per process or for the whole set of processes?
>>> I am running the two processes on different nodes (and nothing else
>>> on the nodes), thus each process has the full 16GB available.
>>>> I hope this helps,
>>>> Gus Correa
>>>> ---------------------------------------------------------------------
>>>> Gustavo Correa
>>>> Lamont-Doherty Earth Observatory - Columbia University
>>>> Palisades, NY, 10964-8000 - USA
>>>> ---------------------------------------------------------------------
>>> Thanks!
>>> - Martin
>>>> Martin Siegert wrote:
>>>>> Hi Michael,
>>>>> On Mon, Nov 16, 2009 at 10:49:23AM -0700, Michael H. Frese wrote:
>>>>>> Martin,
>>>>>> Could it be that your MPI library was compiled using a small memory 
>>>>>> model?  The 180 million doubles sounds suspiciously close to a 2 GB 
>>>>>> addressing limit.
>>>>>> This issue came up on the list recently under the topic "Fortran Array 
>>>>>> size question."
>>>>>> Mike
>>>>> I am running MPI applications that use more than 16GB of memory - I do 
>>>>> not believe that this is the problem. Also -mmodel=large
>>>>> does not appear to be a valid argument for gcc under x86_64:
>>>>> gcc -DNDEBUG -g -fPIC -mmodel=large   conftest.c  >&5
>>>>> cc1: error: unrecognized command line option "-mmodel=large"
>>>>> - Martin
>>>>>> At 05:43 PM 11/14/2009, Martin Siegert wrote:
>>>>>>> Hi,
>>>>>>> I am running into problems when sending large messages (about
>>>>>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>>>>>> # mpicc -g sendrecv.c
>>>>>>> # mpiexec -machinefile m2 -n 2 ./a.out
>>>>>>> id=1: calling irecv ...
>>>>>>> id=0: calling isend ...
>>>>>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 
>>>>>>> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for 
>>>>>>> wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>>>>>> This is with OpenMPI-1.3.3.
>>>>>>> Does anybody know a solution to this problem?
>>>>>>> If I use MPI_Allreduce instead of MPI_Isend/Irecv, the program just hangs
>>>>>>> and never returns.
>>>>>>> I asked on the openmpi users list but got no response ...
>>>>>>> Cheers,
>>>>>>> Martin
>>>>>>> --
>>>>>>> Martin Siegert
>>>>>>> Head, Research Computing
>>>>>>> WestGrid Site Lead
>>>>>>> IT Services                                phone: 778 782-4691
>>>>>>> Simon Fraser University                    fax:   778 782-4242
>>>>>>> Burnaby, British Columbia                  email: siegert at sfu.ca
>>>>>>> Canada  V5A 1S6
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>>>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list