[Beowulf] Re: Problems scaling performance to more than one node, GbE
cap at nsc.liu.se
Mon Feb 23 10:14:55 PST 2009
On Tuesday 17 February 2009, Bogdan Costescu wrote:
> On Mon, 16 Feb 2009, Tiago Marques wrote:
> > I must ask, doesn't anybody on this list run like 16 cores on two
> > nodes well, for a code and job that completes like in a week?
> For GROMACS and other MD programs, the way a job runs depends on a lot
> of factors that define the simulation: the size of the molecular
> system, the force field in use, the cutoff distances, etc.
> I have found several MD codes to scale rather poorly when used on
> clusters composed of 8-core nodes, especially when those 8 cores are
> coming from 2 quad-core Intel CPUs;
A data-point from some testing I've done with gromacs-4.0.2 on dual-quad
Clovertown nodes using the 4 classical test cases from gmxbench:
villin and poly-ch: 7.1x speed up on one node (compared to using one core)
dppc and lzm: superscalar at 8.9x and 8.2x.
Gromacs-4.0.2 seems to be able to almost fully use the four extra cores even
on a memory bandwidth choked node.
> the poor scaling was also with
> InfiniBand (Mellanox ConnectX), so IB will not magically solve your
I've also done some scaling tests on our IB (ConnectX). Gromacs scales (for
the quite small lzm case) as follows:
1 node 8 ranks: 8.2x
2 nodes 16 ranks: 15x
4 nodes 32 ranks: 25x
8 nodes 64 ranks: 38x
Using ethernet (disclaimer: our ethernet isn't really super-tuned since we
mostly run MPI on IB) I got 10x speed up on two nodes using 16 ranks (I
didn't try using more nodes).
Hopefully someone found this late post worth reading,
> The setup that seemed to me like a good compromise was with
> 4-core nodes, when these 4 cores come from 2 dual-core CPUs,
> associated with Myrinet or IB.
> You have to understand that, the way most MD programs are done this
> days, the MD simulations of small molecular systems are simply not
> going to scale, the communication dominates the total runtime.
> Communication through shared memory is still the best way to scale
> such a job, so having a node with as many cores a possible and running
> a job to use all of them is probably going to give you the best
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 189 bytes
Desc: This is a digitally signed message part.
More information about the Beowulf