[Beowulf] IB troubles - mca_mpool_openib_register

Bill Wichser bill at Princeton.EDU
Thu Jun 22 06:01:50 PDT 2006


Cluster with dual Xeons and Topsping IB adapters running a RH 
2.6.9-34.ELsmp kernel (x86_64) with the RH IB stack installed, each node 
w/8G of memory.

Updated firmware as per Mellanox in the IB cards.

Updates /etc/security/limits.conf to have memlock be 8192, both soft and 
hard limits to overcome the initial trouble of pool allocation.

Application is cpi.c.

I can run across the 64 nodes using nodes=64:ppn=1 without trouble, 
except for the

[btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] ibv_create
_qp: returned 0 byte(s) for max inline data

error messages, to be fixed I suppose in the next release.  These I can 
live with, perhaps, for now.

The problem is that when I run with nodes=64:ppn=2 and only use -np 64 
with my openmpi (v 1.0.2 gcc compiled), it still runs fine, but when I 
run with -np 65 I get megabytes of error messages and the job never 
completes.  The errors all look like this:

mca_mpool_openib_register: ibv_reg_mr(0x2a96641000,1060864)
failed with error: Cannot allocate memory

I've submitted to the openib-general mailing list with no responses. I'm 
not sure if this is an openmpi problem, an openib problem, or some 
configuration problem with the IB fabric.  Other programs fail with even 
less processors being allocated with these same errors.  Running over 
TCP, albeit across the GigE network and not over IB, works fine.

I'm stuck here not knowing how to proceed.  Has anyone found this issue 
and, more importantly, found a solution?  I don't believe it to be a 
limits.conf issue as I can allocate both processors on a node up to 32 
nodes (-np 64) without problems.

Thanks,
Bill



More information about the Beowulf mailing list