[Beowulf] Good IB network performance when using 7 cores, poor performance on all 8?
bdobbins at gmail.com
Sat Apr 26 19:08:30 PDT 2014
Thanks for your note -- I've been intending to update the list, but was
hoping for a clear resolution first. It turns out the problem doesn't
*appear* to be specific to QLogic cards - we tested some nodes equipped
with Mellanox adapters in the same cluster and they showed the same issue.
Interestingly, they showed the problem when using 7 *or* 8 cores, whereas
on Qlogic cards it only appears when using all 8 cores. Since it isn't
hardware, and isn't the network, it's got to be something with the OS, so
I'm speaking with our IT guys early next week about new options. Ideally,
we'll try RHEL6.5 and see if the problem persists there - a number of
people have suggested it's much better than RHEL6.2.
It's interesting, though, that the 'working' node image does in fact have
your 'ipath_checkout' executable and the other images (which all give high
variance in timings) don't. However, since all the different images do
have the libpsm_infinipath.so libraries, I think they are in fact using
PSM. And just going by the numbers -*some* runs do show perfectly fine
timings- I'm inclined to think it's not an issue with failing to use PSM.
I'll chalk the fact that some node images don't have the ipath_checkout
executable to just a difference in how things were installed on those ones.
Anyway, we *did* try turning off the C-state power options and
power-savings on the PCIe slot, as well as changing some kernel parameters
(like enabling 'sched_compat_yield'), but so far none of those have
resolved the problem.
I'll definitely update the list once we pin down what the issue was, or
at least what change solved it.
On Sat, Apr 26, 2014 at 9:24 PM, Greg Lindahl <lindahl at pbm.com> wrote:
> On Thu, Apr 24, 2014 at 11:31:56AM -0400, Brian Dobbins wrote:
> > Hi everyone,
> > We're having a problem with one of our clusters after it was upgraded
> > RH6.2 (from CentOS5.5) - the performance of our Infiniband network
> > randomly and severely when using all 8 cores in our nodes for MPI,... but
> > not when using only 7 cores per node.
> Sounds to me like you aren't using the special libpsm library needed
> for good MPI perf with your IB cards. It's supported in OpenMPI, and
> ought to be invoked by default if present... maybe it isn't installed?
> If you've got everything installed the right way, there should be a
> program called ipath_checkout that examines your hardware and software
> and tells you if everything is OK. We never thought our customers
> should have to try to debug things by writing code!
> -- greg
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf