[Beowulf] IB troubles - mca_mpool_openib_register
Bill Wichser
bill at Princeton.EDU
Thu Jun 22 10:37:55 PDT 2006
Thanks.
No I have not tried a different version of MPI to test but will do so.
As for a later version of OpenIB, there is incentive to do so but I
don't know how quickly that can be accomplished.
Bill
Lombard, David N wrote:
> More memory in your nodes? Not sure what size of queues and such
> openmpi allocates, but you could simply be running out of memory if
> openmpi allocates large queue depths.
>
> Have you tried an alternate MPI to see if you have the same problem?
> Intel MPI, MVAPICH, MVAPICH2, as well as others support OpenIB.
>
> Can you consider moving to a newer version of OpenIB?
>
> --
> David N. Lombard
>
> My statements represent my opinions, not those of Intel Corporation
>
>
>>-----Original Message-----
>>From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
>
> On
>
>>Behalf Of Bill Wichser
>>Sent: Thursday, June 22, 2006 6:02 AM
>>To: beowulf at beowulf.org
>>Subject: [Beowulf] IB troubles - mca_mpool_openib_register
>>
>>
>>Cluster with dual Xeons and Topsping IB adapters running a RH
>>2.6.9-34.ELsmp kernel (x86_64) with the RH IB stack installed, each
>
> node
>
>>w/8G of memory.
>>
>>Updated firmware as per Mellanox in the IB cards.
>>
>>Updates /etc/security/limits.conf to have memlock be 8192, both soft
>
> and
>
>>hard limits to overcome the initial trouble of pool allocation.
>>
>>Application is cpi.c.
>>
>>I can run across the 64 nodes using nodes=64:ppn=1 without trouble,
>>except for the
>>
>>[btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
>
> ibv_create
>
>>_qp: returned 0 byte(s) for max inline data
>>
>>error messages, to be fixed I suppose in the next release. These I
>
> can
>
>>live with, perhaps, for now.
>>
>>The problem is that when I run with nodes=64:ppn=2 and only use -np 64
>>with my openmpi (v 1.0.2 gcc compiled), it still runs fine, but when I
>>run with -np 65 I get megabytes of error messages and the job never
>>completes. The errors all look like this:
>>
>>mca_mpool_openib_register: ibv_reg_mr(0x2a96641000,1060864)
>>failed with error: Cannot allocate memory
>>
>>I've submitted to the openib-general mailing list with no responses.
>
> I'm
>
>>not sure if this is an openmpi problem, an openib problem, or some
>>configuration problem with the IB fabric. Other programs fail with
>
> even
>
>>less processors being allocated with these same errors. Running over
>>TCP, albeit across the GigE network and not over IB, works fine.
>>
>>I'm stuck here not knowing how to proceed. Has anyone found this
>
> issue
>
>>and, more importantly, found a solution? I don't believe it to be a
>>limits.conf issue as I can allocate both processors on a node up to 32
>>nodes (-np 64) without problems.
>>
>>Thanks,
>>Bill
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>>http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list