[Beowulf] Parallel application performance tests

Wed Nov 29 06:29:47 PST 2006

Tony,

Interesting work to say the least. A few comments. The TCP
implementation of OpenMPI is known to be sub-optimal (i.e.
it can perform poorly in some situations). Indeed, using LAM
over TCP usually provides much better numbers.

I have found that the single socket Pentium D
(now called Xeon 3000 series) provides great performance.
The big caches help quite a bit, plus it is a single
socket (more sockets means more memory contention).

That said, I believe for the right applications GigE
can be very cost effective. The TCP latency for
the Intel NICs is actually quite good (~28 us)
when the driver options are set properly and GAMMA
takes it to the next level.

I have not had time to read your report in it's entirety,
but I noticed your question about how GigE+GAMMA can do as
well as Infiniband. Well if the application does not
need the extra throughput then there will be no improvement.
The same way that the EP test in the NAS parallel suite is
about the same for every interconnect (EP stands for
Embarrassing Parallel) IS (Integer Sort) on the other hand
is very sensitive to latency.

Now, with multi-socket/multi-core becoming the norm,
better throughput will become more important. I'll have
some tests posted before to long to show the difference
on dual-socket quad-core systems.

Finally, OpenMPI+GAMMA would be really nice. The good news
is OpenMPI is very modular.

Keep up the good work.

  --
  Doug

> I have recently completed a number of performance tests on a Beowulf
> cluster, using up to 48 dual-core P4D nodes, connected by an Extreme
> Networks Gigabit edge switch. The tests consist of single and multi-node
> application benchmarks, including DLPOLY, GROMACS, and VASP, as well as
> specific tests of network cards and switches. I used TCP sockets with
> OpenMPI v1.2 and MPI/GAMMA over Gigabit ethernet. MPI/GAMMA leads to
> significantly better scaling than OpenMPI/TCP in both network tests and in
> application benchmarks. The overall performance of the MPI/GAMMA cluster
> on
> a per cpu basis was found to be comparable to a dual-core Opteron cluster
> with an Infiniband interconnect. The DLPoly benchmark showed similar
> scaling
> to those reported for an IBM p690. The performance using TCP was typically
> a
> factor of 2 less in these same tests. Here are a couple of examples from
> the
> DLPOLY benchmark 1 (27,000 NaCl ions)
>
> CPUS   OpenMPI/TCP (P4D)   MPI/GAMMA (P4D)  OpenMPI/Infiniband (Opteron
> 275)
>
>  1		1255			1276
> 1095
>  2		614			635
> 773
>  4		337			328
> 411
>  8		184			173
> 158
> 16		125			95
> 84
> 32		82			56
> 50
> 64		84			34
> 42
>
> A detailed write up can be found at:
> http://ladd.che.ufl.edu/research/beoclus/beoclus.htm
>
>
>
> Tony Ladd
> Chemical Engineering
> University of Florida
>
> -------------------------------
> Tony Ladd
> Chemical Engineering
> University of Florida
> PO Box 116005
> Gainesville, FL 32611-6005
>
> Tel: 352-392-6509
> FAX: 352-392-9513
> Email: tladd at che.ufl.edu
> Web: http://ladd.che.ufl.edu
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> !DSPAM:456c9566180417110611695!
>

--
Doug