[Beowulf] Parallel application performance tests

Wed Nov 29 06:04:05 PST 2006

Charlie

Not so. Jeff Squyres told me the same thing, but based on my experience it
is not the TCP implementation in OpenMPI that's so bad; it's the algorithms
for the collectives that makes the biggest difference. LAM is about the same
as OpenMPI for point-to-point. See below for a 2 node bidirectional edge
exchange. Results are 1-way throughput in Mbytes/sec. There is a difference
at 64KBytes because OMPI is switching to Rendezvous protocol while LAM was
set to switch at 128K. But other than that the performance is similar.

MPICH has a worse TCP implementation than either of these. But its
collective algorithms are the best I have tested. Particularly All_reduce
which gets used a lot. So in some applications MPICH can edge out OpenMPI or
LAM for large numbers of cpus.

The optimum All_reduce has an asymptotic time proportional to 2M where M is
the message length. LAM and OMPI typically use a binary tree, which is
2M*log_2(Np). This makes a substantial difference for large numbers of
processors (Np > 32). MPICH has a near-optimum algorithm that scales
asymptotically as 4M. So MPICH + GAMMA lays waste to any TCP implementation
including LAM. I have a lot of results for LAM as well, but not quite  as
complete as for OpenMPI. In general I found OpenMPI v1.2 gave similar
application benchmarks to LAM, which is why I didn't bother to report them;
MPI/GAMMA is much faster than either of them.

OpenMPI had horrible collectives in v1.0 and v1.1. I got dreadful All_reduce
performance with TCP + OMPI (throughputs less than 0.1 Mbytes/sec). v1.2 is
much better than v1.1 but still poor in comparison to MPICH. The OpenMPI
developers chose to make the optimization of the collectives very flexible,
but there is no decent interface for handling the optimization yet. Also the
best algorithms (for instance for All_reduce) are not yet implemented as far
as I can tell. My attempts at tuning OpenMPI collectives were not very
successful.

Bottom line is OpenMPI has improved the collectives significantly in v1.2. I
don't see significant differences then between OMPI benchmarks and LAM
benchmarks. But MPI/GAMMA is much better than any TCP implementation, both
for network benchmarks and for applications.

Tony

Size	LAM	OMPI
1	8.4	8.2
2	15.0	15.3
4	21.6	21.8
8	36.0	34.7
16	54.4	53.7
32	74.9	73.0
64	90.8	45.0
128	51.5	51.5
256	55.9	55.7
512	58.3	62.0
1024	61.0	61.0

-----Original Message-----
From: Charlie Peck [mailto:charliep at cs.earlham.edu] 
Sent: Wednesday, November 29, 2006 7:31 AM
To: Tony Ladd
Subject: Re: [Beowulf] Parallel application performance tests

On Nov 28, 2006, at 1:27 PM, Tony Ladd wrote:

> I have recently completed a number of performance tests on a Beowulf 
> cluster, using up to 48 dual-core P4D nodes, connected by an Extreme 
> Networks Gigabit edge switch. The tests consist of single and multi- 
> node application benchmarks, including DLPOLY, GROMACS, and VASP, as
> well as
> specific tests of network cards and switches. I used TCP sockets with
> OpenMPI v1.2 and MPI/GAMMA over Gigabit ethernet. MPI/GAMMA leads to
> significantly better scaling than OpenMPI/TCP in both network tests  
> and in
> application benchmarks.

It turns-out that the TCP binding for OpenMPI is known to have  
problems.  They have been focusing on the proprietary high-speed  
interconnects and haven't had time to go back and improve the  
performance of TCP binding yet.  If you run LAM/TCP you will notice a  
significant difference by comparison.

charlie