[Beowulf] IB troubles - mca_mpool_openib_register

Michael Huntingdon hunting at ix.netcom.com
Thu Jun 22 11:08:19 PDT 2006


If you are going to look into a different MPI implementation, 
consider HP-MPI. The choice of interconnect (GigE, Myrinet, IB, and 
Quadrics) are all written into it, so you can create a single 
(common) operating environment for your programmers. I had a look at 
the benchmarks a few months ago, which appear pretty consistent 
across the board.


At 10:37 AM 6/22/2006, Bill Wichser wrote:
>No I have not tried a different version of MPI to test but will do 
>so. As for a later version of OpenIB, there is incentive to do so 
>but I don't know how quickly that can be accomplished.
>Lombard, David N wrote:
>>More memory in your nodes?  Not sure what size of queues and such
>>openmpi allocates, but you could simply be running out of memory if
>>openmpi allocates large queue depths.
>>Have you tried an alternate MPI to see if you have the same problem?
>>Intel MPI, MVAPICH, MVAPICH2, as well as others support OpenIB.
>>Can you consider moving to a newer version of OpenIB?
>>David N. Lombard
>>My statements represent my opinions, not those of Intel Corporation
>>>-----Original Message-----
>>>From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
>>>Behalf Of Bill Wichser
>>>Sent: Thursday, June 22, 2006 6:02 AM
>>>To: beowulf at beowulf.org
>>>Subject: [Beowulf] IB troubles - mca_mpool_openib_register
>>>Cluster with dual Xeons and Topsping IB adapters running a RH
>>>2.6.9-34.ELsmp kernel (x86_64) with the RH IB stack installed, each
>>>w/8G of memory.
>>>Updated firmware as per Mellanox in the IB cards.
>>>Updates /etc/security/limits.conf to have memlock be 8192, both soft
>>>hard limits to overcome the initial trouble of pool allocation.
>>>Application is cpi.c.
>>>I can run across the 64 nodes using nodes=64:ppn=1 without trouble,
>>>except for the
>>>_qp: returned 0 byte(s) for max inline data
>>>error messages, to be fixed I suppose in the next release.  These I
>>>live with, perhaps, for now.
>>>The problem is that when I run with nodes=64:ppn=2 and only use -np 64
>>>with my openmpi (v 1.0.2 gcc compiled), it still runs fine, but when I
>>>run with -np 65 I get megabytes of error messages and the job never
>>>completes.  The errors all look like this:
>>>mca_mpool_openib_register: ibv_reg_mr(0x2a96641000,1060864)
>>>failed with error: Cannot allocate memory
>>>I've submitted to the openib-general mailing list with no responses.
>>>not sure if this is an openmpi problem, an openib problem, or some
>>>configuration problem with the IB fabric.  Other programs fail with
>>>less processors being allocated with these same errors.  Running over
>>>TCP, albeit across the GigE network and not over IB, works fine.
>>>I'm stuck here not knowing how to proceed.  Has anyone found this
>>>and, more importantly, found a solution?  I don't believe it to be a
>>>limits.conf issue as I can allocate both processors on a node up to 32
>>>nodes (-np 64) without problems.
>>>Beowulf mailing list, Beowulf at beowulf.org
>>>To change your subscription (digest mode or unsubscribe) visit
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit 
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 

More information about the Beowulf mailing list