[Beowulf] MPI_Isend/Irecv failure for IB and large message sizes

Don Holmgren djholm at fnal.gov
Mon Nov 16 14:24:27 PST 2009


Be careful - ulimit's can differ between an interative shell launched with
rsh/ssh, an interactive batch shell launched with "qsub -I" and the like, the 
environment of your batch script, and the environment of the processes launched
via mpirun.  I've been burned by this before.

If you are using a TM-based launch, for example (openmpi or OSU mpiexec), the
ulimit environment on a PBS/Torque batch setup will be governed by the ulimits
of pbs_mom, which in turn is governed by your init process and/or by any of
the ulimit commands in init.d/pbs-client.

The only way to be sure of a particuar ulimit is to to a "get_rlimits()" call in 
your mpi-launched binary and check the size.

Chances are this isn't your problem, though, because usually the error messages
make it pretty clear that a memory lock failure has occurred.

Don Holmgren
Fermilab




On Mon, 16 Nov 2009, Martin Siegert wrote:

> Hi Mark,
>
> On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote:
>>> I am running into problems when sending large messages (about
>>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>>
>> sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK
>> set too low?  (ulimit -l)
>
> Good point.
> By now I have played with all kinds of ulimits (the nodes have 16GB
> of memory and 16GB of swap space - this program is not even coming close
> to those limits). This is the current setting:
> # ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 139264
> max locked memory       (kbytes, -l) unlimited
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 1024
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) unlimited
> real-time priority              (-r) 0
> stack size              (kbytes, -s) unlimited
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 139264
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> ... same error :-(
>
>>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>>
>> 105 looks like it might be an errno to me:
>> #define ENOBUFS         105     /* No buffer space available */
>>
>> regards, mark.
>
> BTW: when using Intel-MPI (MPICH2) the program segfaults with
> l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to
> transfer the data internally and multiply the variable count by 8
> without checking whether the integer overflows ...
>
> - Martin



More information about the Beowulf mailing list