[Beowulf] Re: Problems scaling performance to more than one node, GbE

Sat Feb 14 10:43:56 PST 2009

Tiago Marques <a28427 at ua.pt>

> I've been trying to get the best performance on a small cluster we
have here
> at University of Aveiro, Portugal, but I've not been enable to get most
> software to scale to more than one node.

<SNIP>

> The problem with this setup is that even calculations that take more
than 15
> days don't scale to more than 8 cores, or one node. Usually performance is
> lower with 16cores, 12 cores, than with just 8. From what I've been
reading,
> I should be able to scale fine at least till 16 cores and 32 for some
> software.

<SNIP>
> 
> I tried with Gromacs to have two nodes using one processor each, to
check if
> 8 cores were stressing the GbE too much, and the performance dropped too
> much compared with running two CPUs on the same node. 

Lots of possibilities here. Most of them are probably coming down to the
code not being written to make good use of a cluster environment, and/or
there not being any way to do that (single threaded code with a lot of
unpredictable branching).

For Gromacs I suggest you ask on that mailing list.  My recollection is
that it was known to scale poorly, but that was a couple of years ago,
and maybe they have improved it since then.  If it doesn't scale you can
always get more throughput by running one independent job on each of
your nodes, using local storage to avoid network contention to the file
server.  It may take 15 days to finish a run, but at least you'll have N
times more work completed.  Running N independent jobs will give you at
least as much throughput as running 1 job on N cores.  Admittedly it is
nice to have the results in 1/Nth the time.

Some of what you may be seeing with poorer performance on more cores on
one node is probably related to the effect on memory access, especially
through cache.  Code that can go in and out of cache runs much faster
than anything which has to go to main memory, and as soon as you run two
competing (which depends on architecture) processes you may find that
the two programs are throwing each other's data out of any shared cache,
which can result in dramatic slowdowns.

Give gprof a shot too.  You want to see where your code is spending most
of its time.  If it spends 95% of its time in routines with no network
IO, then the network is likely not your issue.  And vice versa.

> unexpected for me, since the benchmarks I've seen on Gromacs website state
> that I should be able to have 100% scaling on this case, sometimes more.

Contact the person who said that, get the exact conditions, and see if
you can replicate them.  You might have a network issue, but unless you
are comparing apples to apples it may be hard to figure it out.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech