[Beowulf] RE: programming multicore clusters

Fri Jun 15 05:46:49 PDT 2007

> Is running a program using OpenMP on a SMP/multi-core box more efficient that
> an MPI code with an implementation using localhost optimization?

beyond 2-4p, all machines are message passing.

take a look at Intel's recent products: they have products with 
one or two dual-core chips in a package, but if you want a dual
sockets, you get two FSB's - partly for fanout/loading reasons,
and partly because truely symmetric, flat SMP machines just don't 
scale.

OK, so once you accept that even shared-memory machines are actually
passing messages, the question becomes: what kind of protocol and 
message size do you want?  on a typical message-massing SMP machine
(multi-socket x86_64, even SGI Altix), the message size is a cache line
(64 or 128B afaik).  that's a pretty OK number, but to make effective use 
of it, you have to write your code so you make sure to pack as much
relevant data into these appropriately aligned and sized chunks of memory,
knowing that they'll implicitly become packets.  you have to marshal your
packets, if you will.  gosh!  same term is used in explicit msg-passing...

in other words, you have to adopt a message-passing methodology
regardless of whether your packets are fixed-sized implicit things,
or variable-sized, explicit ones.  the main difference is in how 
your messages are addressed - by a simple flat memory address, or 
by something typically like <node,port,tag>.

in some cases, implicit, memory-based addressing is a real win - 
mainly if many of your remote one-sided references are to a space 
that can remain unsynchronized for an extended time (say per timestep).
I don't think I've ever seen a paper that tried to quantify this 
directly, though it would be most interesting...

ccNUMA - provides automatic synchony by tracking the state of each 
cache line.  but limited by cache size, and perhaps this tracking 
is irrelevant given your access patterns.  the level of consistency
may also hurt you, since a naive programmer will waste major cpu time
on false sharing or hot cache lines.

RDMA - similar to ccNUMA except with no 'O' or 'E' states, or tracking
of states at all.  no hardware-supported consistency guarantees, but 
also significantly higher latency.

explicit msg-passing - different addressing, explicit list of data,
not purely what's in a cacheline, but also explicit synchronization,
which may seem too rigid.  latency not that much higher than RDMA.

for the classic example of one worker wanting to collect state
from its grid neighbors, direct memory access seems the most natural.
but MPI codes can handle this pretty successfully by either using 
a nonblocking irecv or by having a data-serving thread.  either one
is, admittedly, extra overhead.

unless most of your IPC is this kind of async, unsync, passive data
reference, I wouldn't think twice: go MPI.  the current media frenzy
about multicore systems (nothing new!) doesn't change the picture much.

regards, mark hahn.