Beowulf & Fluid Mechanics

Mon Jul 17 13:05:43 PDT 2000

Greg Lindahl wrote:
> 
> > To me, the most interesting conclusions based on Brian's tests concern
> > MPI implementationa.
> 
> However, the MVICH result is more odd. MVICH doesn't have to go through the
> kernel to talk to Giganet, right? And as a point of comparison, Myrinet on a
> dual Alpha has an smp penalty of as little as 3% on some of my real
> applications (the FSL weather codes). Myrinet and Giganet are pretty much
> the same, from the programmer point of view.

The MVICH we had was version 0.03, based on MPICH 1.1.2.  Thanks to VIA,
MVICH avoids a buffer copy but it still needs to talk to the kernel at
times (I think).  Comparing MVICH to MPI/Pro suggests that the method of
accessing hardware matters a great deal (at least on Intel platforms).
Even with Fast Ethernet, both LAM and MPICH impose about 25% performance
penalty on SMP nodes.  With faster networks (e.g. Giganet) this penalty
grows to about 40% (MVICH).  In both cases, MPI/Pro has virtually *no*
SMP performance penalty, but MPI/Pro latency figures are poor compared
to LAM/MPICH/MVICH.  This is very odd, and it suggests that while
polling produces low latency, it does so at the expense of wasting a
significant portion of CPU cycles. 

> Obviously I should get a copy of Brian's test before I make the bald-faced
> claim that I'm about to make, but: perhaps the SMP penalty you're seeing
> from MVICH comes from the fact that it's beating on main memory or the PCI
> bus in a tight loop. If it instead did a little in-processor busy loop to
> not poll more than once every few microseconds, the main memory or PCI
> traffic would be significantly lessened, but the latency wouldn't change
> much.

You may be right, but improving memory or PCI bandwidth need not make a
sufficiently huge difference.  Some other bottleneck may be involved
(e.g. a single threaded portion of the kernel).

> SMP effects are an extremely interesting swamp to dive into. I know that
> Compaq SC benchmarks (4-proc Alphas with Quadrics -- but only 2.5 cpus worth
> of main memory bandwidth) can show some *really* interesting performance
> losses for multiple CPUs. I have tried to avoid multi-processor machines for
> this reason, but the 2nd Intel cpu is so cheap that it's hard to dodge using
> them.

Yup.  However, on our machines the price advantage of SMP is only 11%
while performance penalty is at least 25% (using LAM and Fast
Ethernet).  This would favor uniprocessor nodes, which are easier to
administer and more robust anyway, but unfortunately we do not have
enough space in our computer room for so many boxes.  We are buying more
SMP nodes because they pack more CPUs into the same space...

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip at icase.edu
ICASE, Mail Stop 132C           PGP key at http://www.icase.edu./~josip/
NASA Langley Research Center             mailto:j.loncaric at larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134