[Beowulf] Re: Problems scaling performance to more than one node, GbE

Mon Feb 23 10:14:55 PST 2009

On Tuesday 17 February 2009, Bogdan Costescu wrote:
> On Mon, 16 Feb 2009, Tiago Marques wrote:
> > I must ask, doesn't anybody on this list run like 16 cores on two
> > nodes well, for a code and job that completes like in a week?
>
> For GROMACS and other MD programs, the way a job runs depends on a lot
> of factors that define the simulation: the size of the molecular
> system, the force field in use, the cutoff distances, etc.
...
> I have found several MD codes to scale rather poorly when used on
> clusters composed of 8-core nodes, especially when those 8 cores are
> coming from 2 quad-core Intel CPUs;

A data-point from some testing I've done with gromacs-4.0.2 on dual-quad 
Clovertown nodes using the 4 classical test cases from gmxbench:

villin and poly-ch: 7.1x speed up on one node (compared to using one core)
dppc and lzm: superscalar at 8.9x and 8.2x.

Gromacs-4.0.2 seems to be able to almost fully use the four extra cores even 
on a memory bandwidth choked node.

> the poor scaling was also with 
> InfiniBand (Mellanox ConnectX), so IB will not magically solve your
> problems.

I've also done some scaling tests on our IB (ConnectX). Gromacs scales (for 
the quite small lzm case) as follows:
 1 node   8 ranks: 8.2x
 2 nodes 16 ranks: 15x
 4 nodes 32 ranks: 25x
 8 nodes 64 ranks: 38x

Using ethernet (disclaimer: our ethernet isn't really super-tuned since we 
mostly run MPI on IB) I got 10x speed up on two nodes using 16 ranks (I 
didn't try using more nodes).

Hopefully someone found this late post worth reading,
 Peter

> The setup that seemed to me like a good compromise was with 
> 4-core nodes, when these 4 cores come from 2 dual-core CPUs,
> associated with Myrinet or IB.
>
> You have to understand that, the way most MD programs are done this
> days, the MD simulations of small molecular systems are simply not
> going to scale, the communication dominates the total runtime.
> Communication through shared memory is still the best way to scale
> such a job, so having a node with as many cores a possible and running
> a job to use all of them is probably going to give you the best
> performance.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090223/9ceac50f/attachment.sig>