[Beowulf] Cluster OpenMP
Mark Hahn
hahn at physics.mcmaster.ca
Tue May 16 08:17:08 PDT 2006
> http://softwareforums.intel.com/ids/board/message?board.id=11&message.id=3793
>
> Maybe it will be interesting for many people here.
I'm not so sure. DSM has been done for decades now, though I haven't
heard of OpenMP (ish) implementations based on it. fundamentally,
there's no real conceptual challenge to implementing SHM across nets,
just practical matters. regardless of whether you present a SHM
interface to the programmer, you eventually have to adopt the same
basic programming models that reflect the topology of your interconnect.
for instance, scalable OpenMP codes seem to migrate towards looking
a bit like message-passing codes. after all, if you care about latency,
you want to batch together relevant data. and scalability really does
mean caring about latency (and often bandwidth).
doing DSM based on pages is convenient, since it means you can put
your smarts into a library with a fairly straightforward kernel/net
interface. innumerable masters-thesis projects have done this, as well
as Mosix and others. the downside is that to fetch a single byte,
you take a page fault, and do some kind of RPC. but if your shared
data is read-mostly, or naturally very granular, you're golden.
hooking into the language is a popular way to break up the chunks -
the language can simply emit get/put at language-appropriate places.
for those apps hurt by page-based sharing, this is certainly better.
but writing a compiler, even a preprocessor, is a pretty big deal.
there have been multiple implementations of this approach, but somehow
they never gain much traction.
doing an implementation that is really smart about handling sequential
vs random patterns (prefetching, etc), doing all the right locking,
load-balancing, accepting programmer hints/assertions, etc, that's
a pretty big undertaking, and I don't know of a system which has done
it, well. also, there are sort of intermediate interfaces, such as
Global Arrays or Charm++.
and fundamentally, you have to notice that SHM approaches tend to yield
quite modest speedups. for instance, in the Intel whitepaper, they're
showing speedups of 3-8 for a 16x machine. that's really not very good.
if you insist on a SHM interface, IMO you're best off sticking to fairly
small SMP machines as the mass-market inches up. right now, that's
probably 4x2 opterons, but with quad-cores coming (and some movement
towards smarter system fabrics by AMD and Intel), affordable SMP is
likley to grow to ~32x within a year or two.
IMO, anyone who needs "real" scaling (>64x real speedup, say) has
already bitten the MPI bullet. but I'm willing to be told I'm a
message-passing bigot ;)
regards, mark hahn.
More information about the Beowulf
mailing list