[Beowulf] IB troubles - mca_mpool_openib_register
hunting at ix.netcom.com
Thu Jun 22 11:08:19 PDT 2006
If you are going to look into a different MPI implementation,
consider HP-MPI. The choice of interconnect (GigE, Myrinet, IB, and
Quadrics) are all written into it, so you can create a single
(common) operating environment for your programmers. I had a look at
the benchmarks a few months ago, which appear pretty consistent
across the board.
At 10:37 AM 6/22/2006, Bill Wichser wrote:
>No I have not tried a different version of MPI to test but will do
>so. As for a later version of OpenIB, there is incentive to do so
>but I don't know how quickly that can be accomplished.
>Lombard, David N wrote:
>>More memory in your nodes? Not sure what size of queues and such
>>openmpi allocates, but you could simply be running out of memory if
>>openmpi allocates large queue depths.
>>Have you tried an alternate MPI to see if you have the same problem?
>>Intel MPI, MVAPICH, MVAPICH2, as well as others support OpenIB.
>>Can you consider moving to a newer version of OpenIB?
>>David N. Lombard
>>My statements represent my opinions, not those of Intel Corporation
>>>From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
>>>Behalf Of Bill Wichser
>>>Sent: Thursday, June 22, 2006 6:02 AM
>>>To: beowulf at beowulf.org
>>>Subject: [Beowulf] IB troubles - mca_mpool_openib_register
>>>Cluster with dual Xeons and Topsping IB adapters running a RH
>>>2.6.9-34.ELsmp kernel (x86_64) with the RH IB stack installed, each
>>>w/8G of memory.
>>>Updated firmware as per Mellanox in the IB cards.
>>>Updates /etc/security/limits.conf to have memlock be 8192, both soft
>>>hard limits to overcome the initial trouble of pool allocation.
>>>Application is cpi.c.
>>>I can run across the 64 nodes using nodes=64:ppn=1 without trouble,
>>>except for the
>>>_qp: returned 0 byte(s) for max inline data
>>>error messages, to be fixed I suppose in the next release. These I
>>>live with, perhaps, for now.
>>>The problem is that when I run with nodes=64:ppn=2 and only use -np 64
>>>with my openmpi (v 1.0.2 gcc compiled), it still runs fine, but when I
>>>run with -np 65 I get megabytes of error messages and the job never
>>>completes. The errors all look like this:
>>>failed with error: Cannot allocate memory
>>>I've submitted to the openib-general mailing list with no responses.
>>>not sure if this is an openmpi problem, an openib problem, or some
>>>configuration problem with the IB fabric. Other programs fail with
>>>less processors being allocated with these same errors. Running over
>>>TCP, albeit across the GigE network and not over IB, works fine.
>>>I'm stuck here not knowing how to proceed. Has anyone found this
>>>and, more importantly, found a solution? I don't believe it to be a
>>>limits.conf issue as I can allocate both processors on a node up to 32
>>>nodes (-np 64) without problems.
>>>Beowulf mailing list, Beowulf at beowulf.org
>>>To change your subscription (digest mode or unsubscribe) visit
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf