[Beowulf] multi-threading vs. MPI

Mon Dec 10 16:40:15 PST 2007

Eray Ozkural wrote:
> On Dec 9, 2007 9:37 PM, Toon Knapen <toon.knapen at gmail.com> wrote:
> 
>> And considering that future processors are even going more extreme in
>> the Numa direction (e.g. the Intel 80-core), is'nt it more future-safe
>> to go with MPI if one would start a large coding-project now?
>>
>> thanks for all the reactions,
> 
> I think that's a good point. For NUMA obviously MPI is more useful.

I have been staying out of the debate thus far, as I believe that it is 
more likely to generate heat than light.

A few obvious points:

a) single benchmarks do not a definitive statement make
b) the only code that matters is your code (really, this should be 
everyone's mantra with benchmarking in general).

12 years ago, after starting work at SGI, I had to work hard to convince 
people that a 75 MHz R8000 chip could actually be faster (e.g. lower 
wall clock time on real app with real data) than a 233 MHz (or whatever 
it was) Alpha chip.  It was "obvious" to most people that Alpha was 
faster.  That was, it was obvious from the cpu clock, various "standard" 
benchmark cases, and so forth ... until they ran their own codes, and 
saw some rather different results.

The point of this is that I see the same thing playing out here, with 
people's opinions and notes generating the heat.  I would prefer to try 
to shed a little light if possible, and keep the heat level as low as 
possible.

FWIW:  I have been using OpenMP for something like 11 years (pretty much 
since inception), and MPI since about 1997.  I have used both in 
projects with customers, end users, collaborators.  I have taught 
graduate level courses in HPC programming using both.

Generally speaking, I find scientists/engineers generally "get" OpenMP 
more easily than MPI.  They have to work less hard to get some benefit 
from OpenMP than MPI.

This above statement I expect to generate great deals of heat, which is 
a shame, as the next statement should generate a great deal of light.

This said, since OpenMP does stuff for you, you have to think and work 
harder to prevent the performance killing conditions which can and often 
do show up in real code.  OpenMP lets you share data, and as you 
increase the number of CPUs sharing the data, on average the shared data 
is often the bottleneck.   Then again, with some careful re-crafting of 
the code ... not a complete rewrite, it is entirely possible to mitigate 
many of the issues.  That is OpenMP saves you from thinking hard to get 
some benefit, but you need to think hard to get good benefit for larger 
systems.  More about this in a minute.

MPI is harder (though some may disagree).  You have to rewrite and 
rethink your code.  While this is harder, this is also a good thing.  It 
forces you to explicitly consider data locality issues (NUMA is an 
example of a data locality hierarchy) which OpenMP does not explicitly 
force you to consider.  It forces you to avoid global data, and all the 
pain that goes with it (false sharing, atomic updates, ...).  It forces 
you to explicitly move data.

Also, unlike OpenMP, the communication model can be easily matched to 
the underlying problem.  Which tends to mean a tighter coupling of the 
computing resource to the algorithm.  OpenMP is a bag-o-threads, and you 
don't have an "explicit" communication pattern between threads.

I don't consider one "better" than the other for all problems.  For 
certain classes of problem, OpenMP is the logical and obvious choice, 
while MPI is the logical and obvious choice for other classes.  Aside 
from this, without channeling an ex-US president, we need to define what 
"better" means.  Faster execution on model problems?  Faster 
benchmarking?  Faster development, ease of code 
testing/debugging/management?

I do agree with Greg in that I have not to date seen a code where the 
hybrid model is better than the pure model.

Back to Eray's point.

For NUMA, you have a small set of data points which show that MPI 
provides superior performance on a code.  The question is whether or not 
the OpenMP code used first-touch or similar allocation ... without more 
information, it is fairly hard to draw conclusions, never mind general 
conclusions.  Large SGI machines have gobs of NUMA shared memory, and 
you can get very good scalability with (non-trivial) OpenMP codes.

What we see going forward are desktops with 4-16 cores (biased as this 
is what we are doing/selling) and a shared memory system.  NUMA for AMD, 
flat (non-NUMA) for Intel.  Intel is going to NUMA as far as I have seen 
at SC07 and elsewhere (and Intel folks, please do step in and let me 
know if I am wrong).  A well written OpenMP code, that knows how to use 
memory correctly, should be able to exploit these multiple memory buses 
without too many issues.  The streams code is an example of a "trivial" 
(sorry John) code which operates in OpenMP very nicely.

There are others.  A fair number of commercial codes with large solvers 
don't do decomposition very well, and tend not to have great MPI 
versions, or not so great MPI scalability.  They do shared memory quite 
nicely, and will scale well on large processor count machines with lots 
of memory buses (MSC/NASTRAN, various other similar codes, ...).

What I am much belaboring here is that it is *not* obvious at all that 
one or the other method is "better" in a general sense (due to the fact 
that "better" is not well defined to begin with in this context).

Our view has always been use what you are comfortable with, and what you 
need.  If you need to run across a cluster, use MPI.  If you need to run 
across a single large memory machine, use OpenMP.

FWIW:  I would suggest learning both.  With the advent of many-core 
workstations, and accelerator systems with many many cores, programming 
these things is more likely to be mediated by a compiler (OpenMP like) 
than putting MPI stacks on the Cell SPUs (not enough local scratchpad 
ram for it).

Just my $0.02, and I hope I generated light, and very little heat.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615