> > I've been trying to get the best performance on a small cluster we
> have here
> > at University of Aveiro, Portugal, but I've not been enable to get most
> > software to scale to more than one node.
> <SNIP>
> > The problem with this setup is that even calculations that take more
> than 15
> > days don't scale to more than 8 cores, or one node. Usually performance
> is
> > lower with 16cores, 12 cores, than with just 8. From what I've been
> reading,
> > I should be able to scale fine at least till 16 cores and 32 for some
> > software.
> <SNIP>
> >
> > I tried with Gromacs to have two nodes using one processor each, to
> check if
> > 8 cores were stressing the GbE too much, and the performance dropped too
> > much compared with running two CPUs on the same node.
> Lots of possibilities here. Most of them are probably coming down to the
> code not being written to make good use of a cluster environment, and/or
> there not being any way to do that (single threaded code with a lot of
> unpredictable branching).
> For Gromacs I suggest you ask on that mailing list.  My recollection is
> that it was known to scale poorly, but that was a couple of years ago,
> and maybe they have improved it since then.  If it doesn't scale you can
> always get more throughput by running one independent job on each of
> your nodes, using local storage to avoid network contention to the file
> server.  It may take 15 days to finish a run, but at least you'll have N
> times more work completed.  Running N independent jobs will give you at
> least as much throughput as running 1 job on N cores.  Admittedly it is
> nice to have the results in 1/Nth the time.
Already did that, not too many helpful people on Gromacs list... They just
told me to wait for 4.0 version, which I did, which scales better, though
still not as I hoped.
Were already running a single job per node for months but it would be good
to have the chance to run jobs faster, sometimes it's needed.

> Some of what you may be seeing with poorer performance on more cores on
> one node is probably related to the effect on memory access, especially
> through cache.  Code that can go in and out of cache runs much faster
> than anything which has to go to main memory, and as soon as you run two
> competing (which depends on architecture) processes you may find that
> the two programs are throwing each other's data out of any shared cache,
> which can result in dramatic slowdowns.
> Give gprof a shot too.  You want to see where your code is spending most
> of its time.  If it spends 95% of its time in routines with no network
> IO, then the network is likely not your issue.  And vice versa.
I have thought of that, but I didn't manage to do it on the more important
codes. It compiles but just doesn't spit out the profiling output.
I have used "iftop" to measure network usage and it's probably around
300-400Mbit/s, so I was poiting the problem at latency, throughput seems
fine. While copying files with "scp", I can get 93MB/s.

> > unexpected for me, since the benchmarks I've seen on Gromacs website
> state
> > that I should be able to have 100% scaling on this case, sometimes more.
> Contact the person who said that, get the exact conditions, and see if
> you can replicate them.  You might have a network issue, but unless you
> are comparing apples to apples it may be hard to figure it out.

True. Thanks for the help.
I must ask, doesn't anybody on this list run like 16 cores on two nodes
well, for a code and job that completes like in a week?
Or most code that gets done in a week/two weeks only scales with InfiniBand
and the like? For like 99% of the cases.
Best regards,
                         Tiago Marques

