[Beowulf] MPI performance on clusters of SMP

Fri Aug 27 00:47:57 PDT 2004

On Thu, 2004-08-26 at 20:22, Kozin, I (Igor) wrote:
> Thus using both cpus on a node creates even higher load on the 
> connection. Hypothetically, when the memory bandwidth and 
> the switch are not a problem then 

> using N x 2 configuration 
> with 2 network cards per node should be always superior to 
> using 2*N x 1 config with 1 network card per node.
> (same number of cards and cpus!).

  Err. Not always ;)

  As Philippe pointed out, it depends on the architecture of the dual
cpu boxes, too. 
  With Opterons it should be effectively faster using Nx2 w/ 2 nics than
2Nx1 w/ 1 nic (or at least, it shouldn't be slower).
  With Xeons, I'd say that most of the time the opposite is true.
  This is due to the different memory architecture of the two cpus.
Opterons have integrated memory controllers, while Xeons share a single,
off-chip memory controller. 

  Keep in mind that (unless your problem entirely fits on L1/L2 cache),
using a 2 cpu box doesn't guarantee you to have 2x the speed (in fact,
it never reaches 2x the speed). Usually, you'll see a 60-90% improvement
over a single processor. With Opterons (I'm assuming we're only talking
about x86 architectures, and thus I'm not considering mips, pa-risc, and
so on) this figures are usually higher than with Xeons, mainly because
of the memory architecture used by the two.

  To give you an example of the complexity of the problem, on our
dual-Xeons, Infiniband cluster, using both Gromacs and CPMD, we are able
to achieve a much better scalability using 2*N x 1 cpus than N x 2. It
looks like dual-Xeons have a big performance hit when the two cpus run
the same process (thus with the same memory access pattern). Please note
that if you run a copy of Gromacs on N x 1 and at the same time you run
a copy of CPMD on N x 1 (thus using all 2*N cpus, but with different
processes between each of the two cpus of each node, and thus different
mem access patterns), you achieve almost perfect scalability with both
programs. Thus, it looks like Xeons are heavily (negatively) impacted by
the mem access pattern of the processes running on the cpus of a smp
box. I've been told that with Opterons this kind of problems are much
less present.

 Oh, btw, the same olds true if you launch two Gromacs instances each on
N x 1 at different times (or with different inputs), so it really looks
like it is a problem directly related to the memory access patterns of
the processes running on the cpus.

 In the  end, I'd say that if you plan to run only a single copy of a
single programs across the whole cluster, you'd better off with a 2*N x
1 solution. On the other hand, if you plan to run different programs at
the same time, a N x 2 solution is much better (you have much lower
costs (provided we're talking about high speed interconnections, because
fast ethernet and even gigabit are quite cheap right now) with almost
the same (global) performance).

 Have a good day,

 Franz

---------------------------------------------------------
Franz Marini
Sys Admin and Software Analyst,
Dept. of Physics, University of Milan, Italy.
email : franz.marini at mi.infn.it
phone : +39 02 50317221
---------------------------------------------------------