[Beowulf] bizarre scaling behavior on a Nehalem

Tiago Marques a28427 at ua.pt
Tue Nov 10 10:31:41 PST 2009

Hi all,

Sorry to ressurect this thread after all this time but I just figured
out the problem with VASP by chance.

VASP's INCAR file accepts one parameter that both fixes scalability
problems and increases performance at the same time, even if you still
stick to 6 cores. That parameter is NPAR.
I was recommended to set NPAR=2 for most calculations and it worked
great. Still, I experimented a bit and NPAR=1 and it gave even better
results. It seems VASP, by default, is using NPAR=NCPUS, which
cripples performance if you don't use multiples of 3.

" running on    8 nodes
 distr:  one band on    8 nodes,    1 groups"

This is with NPAR=1

NPAR=2 gives something like:

" running on    8 nodes
 distr:  one band on    4 nodes,    2 groups"

Enjoy the performance increase, if you haven't still. To us it
increased around 33% in conjunction with running 8 CPUs. It seems to
me that groups may be useful to run with more nodes and not just one
machine but I haven't had the chance to test that out.

On Tue, Aug 11, 2009 at 6:57 PM, Rahul Nabar <rpnabar at gmail.com> wrote:
> On Tue, Aug 11, 2009 at 12:40 PM, Craig Tierney<Craig.Tierney at noaa.gov> wrote:
>> What are you doing to ensure that you have both memory and processor
>> affinity enabled?
> All I was using now was the flag:
> --mca mpi_paffinity_alone 1

I was actually using that on the Xeons 54xx, since the processors
aren't native quad-cores, the kernel would keep threads bouncing from
core to core to achive a proper load balance. This was the best it
could do and I managed to get about 3% better performance from using
that together with disabling some kernel option I don't quite remember
right now, so the threads wouldn't jump around anymore. If you didn't
disabled the load balancing the code would inevitably mis-schedule and
the code would end up running with only 5 cores(or from start) and
calculations would take around 10x longer.
This was only useful in 6 cores per node, as then each processor would
be running precisely 3 threads. With eight I haven't tried it but I
assume the advantage would be negligible.

Best regards,
Tiago Marques

> Is there anything else I ought to be doing as well?
> --
> Rahul
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list