[Beowulf] Maximizing intra-node communication performance

Wed Dec 28 20:11:57 PST 2005

Hi Tahir:

Tahir Malas wrote:
> Hi all,
> Taking advice from a previous discussion, we have purchased an Tyan server
> with 8 dual-core Opteron 870 processors. Now I want to wonder how I can
> maximize the intra-node communication of the server. We have been using

By maximize, do you mean maximizing bandwidth?  Minimizing latency?  Both?

> LAM-MPI, but I think that TCP/IP protocol may degrade the performance. 

In mpich 1.2.x using the ch_p4 device, I am not sure if it will 
automatically use shared memory for MPI processes running on the same 
machine.  I suspect not.  I have used ch_shmem with such units with some 
success, though you have to start worrying about contention for shared 
memory arenas in a quad system when you are using a shared memory 
device.  Also, you need to make sure that memorys and processes are 
pinned to the appropriate cpu (affinity scheduling using numactl and 
other bits).

> Has
> anybody tried new implementations of MPI, or anybody knows some other
> support for intra-node communication?

With mpich 1.2.x you could use ch_shmem.  I have run into some 
performance issues with this in the recent past, where an 8 way run on a 
dual core quad unit using mpich and the ch_shmem device was not as fast 
similar runs using other mpi stacks (mpich-ib, mpich-gm).  I have done 
some very recent work with mpi and compiler bits from Pathscale for the 
LAMMPS code (molecular dynamics) which have shown excellent scalability 
per node and across nodes.

I have not been successful to date getting LAMMPS to run with LAM.  LAM 
7.x offers (IMO) some nice features/functionality relative to mpich 1.2.x .

The issues in running on large NUMA systems are significant.  For large 
shared memory units with lots of memory controllers, you need to worry 
about first touch (usually more so with OpenMP) allocations.  You really 
don't want lots of other things to get in the way of your performance, 
so time spent traversing a network stack is to be avoided.  A good MPI 
implmentation is in order.

If you will only run on individual nodes and never across nodes, OpenMP 
can be quite powerful.  Mixed model (MPI across nodes, OpenMP on each 
node) is somewhat harder to do.

Joe

> Thanks in advance,
> Tahir Malas
> Bilkent University 
> Electrical and Electronics Engineering Department
> Phone: +90 312 290 1385 

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615