[Beowulf] MPI performance on clusters of SMP
Mario Donato Marino
mario at regulus.pcs.usp.br
Fri Aug 27 10:11:59 PDT 2004
On Fri, 27 Aug 2004, Franz Marini wrote:
> On Thu, 2004-08-26 at 20:22, Kozin, I (Igor) wrote:
> To give you an example of the complexity of the problem, on our
> dual-Xeons, Infiniband cluster, using both Gromacs and CPMD, we are able
> to achieve a much better scalability using 2*N x 1 cpus than N x 2. It
> looks like dual-Xeons have a big performance hit when the two cpus run
> the same process (thus with the same memory access pattern). Please note
> that if you run a copy of Gromacs on N x 1 and at the same time you run
> a copy of CPMD on N x 1 (thus using all 2*N cpus, but with different
> processes between each of the two cpus of each node, and thus different
> mem access patterns), you achieve almost perfect scalability with both
> programs. Thus, it looks like Xeons are heavily (negatively) impacted by
> the mem access pattern of the processes running on the cpus of a smp
> box. I've been told that with Opterons this kind of problems are much
> less present.
But, what is the main reason of achieving better scalability using
different memory access pattern in a dual-Xeon?
> Oh, btw, the same olds true if you launch two Gromacs instances each on
> N x 1 at different times (or with different inputs), so it really looks
> like it is a problem directly related to the memory access patterns of
> the processes running on the cpus.
> In the end, I'd say that if you plan to run only a single copy of a
> single programs across the whole cluster, you'd better off with a 2*N x
> 1 solution. On the other hand, if you plan to run different programs at
> the same time, a N x 2 solution is much better (you have much lower
> costs (provided we're talking about high speed interconnections, because
> fast ethernet and even gigabit are quite cheap right now) with almost
> the same (global) performance).
More information about the Beowulf