[Beowulf] recommendations for cluster upgrades

Sat May 16 16:54:58 PDT 2009

On Sat, May 16, 2009 at 11:56 PM, Rahul Nabar <rpnabar at gmail.com> wrote:

> On Sat, May 16, 2009 at 2:34 PM, Tiago Marques <a28427 at ua.pt> wrote:
> > One of the codes, VASP, is very bandwidth limited and loves to run in a
> > number of cores multiple of 3. The 5400s are also very bandwith - memory
> and
> > FSB - limited which causes that they sometimes don't scale well above 6
> > cores. They are very fast per core, as someone mentioned, when compared
> to
> > AMD cores.
>
> Thanks Tiago. This is super useful info. VASP is one of our major
> "users" too. Possibly 40% of the cpu-time. Rest is a similar
> computational chemistry code, DACAPO.
>
> It would be interesting to compare my test-run times on our
> AMD-Opterons (Barcelona). Is is possible to share what your benchmark
> job was?
>

I'll try to talk to the user who crafted it for me before, but it should be
no problem to pass it to you after.

>
> Since you mention VASP is bandwidth limited do you mean memory
> bandwidth or the interconnect? Maybe this question itself is naiive.
> Not sure. What interconnect do you use? We have gigabit ethernet dual
> bonded.
>

Memory bandwith, as you can see by the performance gain from going to
1600MHz from 1066, even with looser timings IIRC.
Of course interconnects also play a role, even internal ones, which in the
case of Xeons was a very slow FSB.

I use single GbE because for as much as I could benchmark, I hardly found
anything that could use more than one node efficiently and no one - not even
here - could help me with that. Seems I need infiband.
I only managed to increase 33% with two nodes when using a really huge
job(+100k atoms) on Gromacs.

>
> A side note: I wasn't aware of VASP preferring multiples of 3 cpus.
> I'm not a big VASP user myself but I see my users often submit jobs on
> multiples of 8 since we have 8 cpus / box.  Is that a big drag on
> performance? Why? I thought VASP parallelized over bands so could
> scale well to any cpu-multiple?
>

As far as I can tell it's because of the algorithm it's based on. I had
thought it might be related to the crap interconnects on the Xeons until I
benchmarked it on Nehalem.

If you can manage to find a way to scale without being a multiple of 3,
please tell me so! I mostly just do administration of the cluster and
optimization of the codes by using compilers and libraries, but the users
couldn't really find me a job - that took hours or minutes - that didn't had
this exact behavior. It was the main user who suggested that the algorithm
might be to blame, as he is far more familiar with what goes behind the
stage than I am. To me, it just seems so.

VASP isn't the only one though, but for different reasons.
While most codes don't scale significantly from 5/6 to 8 cores, they usually
do. This is the case with Gromacs, Gaussian and DL-Poly. I don't recall
which one exactly right now, since no one has been using them for months,
but it was either CPMD or Quantum Espresso that only scaled to 7 cores, add
the eight and it was slower. This was due to the lousy FSB architecture and
slow memory, probably more due to the FSB than the memory.
I even managed to get more performance by compiling VASP with bandwidth
optimizations available in Ifort, more exactly the -opt-mem-bandwidth3
option. Not much though, a few percentage points.

Which brings to a point that I forgot to mention to you. When considering
Intel machines, you can always get a compiler license for $2000, give or
take, and that will add 15% more performance to what you already are
counting on. That was what I got on average for most codes, except gromacs,
which already comes with lots of assembly inside, still got some some 6%
though.
I don't think that GCC compilers are that well optimized for Opterons, which
may be a reason why some AMD clusters are using Pathscale compilers. Given
the price of the hardware, a compiler license will surely be worth it, if
you can handle the headaches associated with some compatibility problems
that arise from time to time. Intel support is really helpful though. Took
me some time but managed to also get gaussian03 compiled with Ifort @
x86-64, but hadn't had the time to measure the gains compared to PGF yet.
Intel helped me a lot with this tough nut.

Best regards,

Tiago Marques

>
>
> --
> Rahul
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090517/e2dfe6f6/attachment.html>