[Beowulf] MPI_Isend/Irecv failure for IB and large message sizes

Martin Siegert siegert at sfu.ca
Mon Nov 16 13:24:50 PST 2009


Hi Mark,

On Sun, Nov 15, 2009 at 03:38:08PM -0500, Mark Hahn wrote:
>> I am running into problems when sending large messages (about
>> 180000000 doubles) over IB. A fairly trivial example program is attached.
>
> sorry if you've already thought of this, but might you have RLIMIT_MEMLOCK
> set too low?  (ulimit -l)

Good point.
By now I have played with all kinds of ulimits (the nodes have 16GB
of memory and 16GB of swap space - this program is not even coming close
to those limits). This is the current setting:
# ulimit -a
core file size          (blocks, -c) 0                            
data seg size           (kbytes, -d) unlimited                    
scheduling priority             (-e) 0                            
file size               (blocks, -f) unlimited                    
pending signals                 (-i) 139264
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) unlimited
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 139264
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

... same error :-(

>> [[60322,1],1][btl_openib_component.c:2951:handle_wc] from b1 to: b2 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 199132400 opcode 549755813  vendor error 105 qp_idx 3
>
> 105 looks like it might be an errno to me:
> #define ENOBUFS         105     /* No buffer space available */
>
> regards, mark.

BTW: when using Intel-MPI (MPICH2) the program segfaults with
l = 26843546 = 2^31/8 which makes me suspect that they use MPI_Byte to
transfer the data internally and multiply the variable count by 8
without checking whether the integer overflows ...

- Martin



More information about the Beowulf mailing list