[Beowulf] IB troubles - mca_mpool_openib_register
Michael Huntingdon
hunting at ix.netcom.com
Thu Jun 22 11:08:19 PDT 2006
Bill
If you are going to look into a different MPI implementation,
consider HP-MPI. The choice of interconnect (GigE, Myrinet, IB, and
Quadrics) are all written into it, so you can create a single
(common) operating environment for your programmers. I had a look at
the benchmarks a few months ago, which appear pretty consistent
across the board.
Michael
At 10:37 AM 6/22/2006, Bill Wichser wrote:
>Thanks.
>
>No I have not tried a different version of MPI to test but will do
>so. As for a later version of OpenIB, there is incentive to do so
>but I don't know how quickly that can be accomplished.
>
>Bill
>
>Lombard, David N wrote:
>>More memory in your nodes? Not sure what size of queues and such
>>openmpi allocates, but you could simply be running out of memory if
>>openmpi allocates large queue depths.
>>Have you tried an alternate MPI to see if you have the same problem?
>>Intel MPI, MVAPICH, MVAPICH2, as well as others support OpenIB.
>>Can you consider moving to a newer version of OpenIB?
>>--
>>David N. Lombard
>>My statements represent my opinions, not those of Intel Corporation
>>
>>>-----Original Message-----
>>>From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
>>On
>>
>>>Behalf Of Bill Wichser
>>>Sent: Thursday, June 22, 2006 6:02 AM
>>>To: beowulf at beowulf.org
>>>Subject: [Beowulf] IB troubles - mca_mpool_openib_register
>>>
>>>
>>>Cluster with dual Xeons and Topsping IB adapters running a RH
>>>2.6.9-34.ELsmp kernel (x86_64) with the RH IB stack installed, each
>>node
>>
>>>w/8G of memory.
>>>
>>>Updated firmware as per Mellanox in the IB cards.
>>>
>>>Updates /etc/security/limits.conf to have memlock be 8192, both soft
>>and
>>
>>>hard limits to overcome the initial trouble of pool allocation.
>>>
>>>Application is cpi.c.
>>>
>>>I can run across the 64 nodes using nodes=64:ppn=1 without trouble,
>>>except for the
>>>
>>>[btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
>>ibv_create
>>
>>>_qp: returned 0 byte(s) for max inline data
>>>
>>>error messages, to be fixed I suppose in the next release. These I
>>can
>>
>>>live with, perhaps, for now.
>>>
>>>The problem is that when I run with nodes=64:ppn=2 and only use -np 64
>>>with my openmpi (v 1.0.2 gcc compiled), it still runs fine, but when I
>>>run with -np 65 I get megabytes of error messages and the job never
>>>completes. The errors all look like this:
>>>
>>>mca_mpool_openib_register: ibv_reg_mr(0x2a96641000,1060864)
>>>failed with error: Cannot allocate memory
>>>
>>>I've submitted to the openib-general mailing list with no responses.
>>I'm
>>
>>>not sure if this is an openmpi problem, an openib problem, or some
>>>configuration problem with the IB fabric. Other programs fail with
>>even
>>
>>>less processors being allocated with these same errors. Running over
>>>TCP, albeit across the GigE network and not over IB, works fine.
>>>
>>>I'm stuck here not knowing how to proceed. Has anyone found this
>>issue
>>
>>>and, more importantly, found a solution? I don't believe it to be a
>>>limits.conf issue as I can allocate both processors on a node up to 32
>>>nodes (-np 64) without problems.
>>>
>>>Thanks,
>>>Bill
>>>_______________________________________________
>>>Beowulf mailing list, Beowulf at beowulf.org
>>>To change your subscription (digest mode or unsubscribe) visit
>>>http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>>http://www.beowulf.org/mailman/listinfo/beowulf
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list