[Beowulf] IB troubles - mca_mpool_openib_register

Bill Wichser bill at Princeton.EDU
Thu Jun 22 10:37:55 PDT 2006


Thanks.

No I have not tried a different version of MPI to test but will do so. 
As for a later version of OpenIB, there is incentive to do so but I 
don't know how quickly that can be accomplished.

Bill

Lombard, David N wrote:
> More memory in your nodes?  Not sure what size of queues and such
> openmpi allocates, but you could simply be running out of memory if
> openmpi allocates large queue depths.
> 
> Have you tried an alternate MPI to see if you have the same problem?
> Intel MPI, MVAPICH, MVAPICH2, as well as others support OpenIB.
> 
> Can you consider moving to a newer version of OpenIB?
> 
> --
> David N. Lombard
> 
> My statements represent my opinions, not those of Intel Corporation 
> 
> 
>>-----Original Message-----
>>From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
> 
> On
> 
>>Behalf Of Bill Wichser
>>Sent: Thursday, June 22, 2006 6:02 AM
>>To: beowulf at beowulf.org
>>Subject: [Beowulf] IB troubles - mca_mpool_openib_register
>>
>>
>>Cluster with dual Xeons and Topsping IB adapters running a RH
>>2.6.9-34.ELsmp kernel (x86_64) with the RH IB stack installed, each
> 
> node
> 
>>w/8G of memory.
>>
>>Updated firmware as per Mellanox in the IB cards.
>>
>>Updates /etc/security/limits.conf to have memlock be 8192, both soft
> 
> and
> 
>>hard limits to overcome the initial trouble of pool allocation.
>>
>>Application is cpi.c.
>>
>>I can run across the 64 nodes using nodes=64:ppn=1 without trouble,
>>except for the
>>
>>[btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
> 
> ibv_create
> 
>>_qp: returned 0 byte(s) for max inline data
>>
>>error messages, to be fixed I suppose in the next release.  These I
> 
> can
> 
>>live with, perhaps, for now.
>>
>>The problem is that when I run with nodes=64:ppn=2 and only use -np 64
>>with my openmpi (v 1.0.2 gcc compiled), it still runs fine, but when I
>>run with -np 65 I get megabytes of error messages and the job never
>>completes.  The errors all look like this:
>>
>>mca_mpool_openib_register: ibv_reg_mr(0x2a96641000,1060864)
>>failed with error: Cannot allocate memory
>>
>>I've submitted to the openib-general mailing list with no responses.
> 
> I'm
> 
>>not sure if this is an openmpi problem, an openib problem, or some
>>configuration problem with the IB fabric.  Other programs fail with
> 
> even
> 
>>less processors being allocated with these same errors.  Running over
>>TCP, albeit across the GigE network and not over IB, works fine.
>>
>>I'm stuck here not knowing how to proceed.  Has anyone found this
> 
> issue
> 
>>and, more importantly, found a solution?  I don't believe it to be a
>>limits.conf issue as I can allocate both processors on a node up to 32
>>nodes (-np 64) without problems.
>>
>>Thanks,
>>Bill
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>>http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list