[Beowulf] RE: programming multicore clusters

Joe Landman landman at scalableinformatics.com
Fri Jun 15 06:57:08 PDT 2007

Toon Knapen wrote:
> Mark Hahn wrote:
>> unless most of your IPC is this kind of async, unsync, passive data
>> reference, I wouldn't think twice: go MPI.  the current media frenzy
>> about multicore systems (nothing new!) doesn't change the picture much.
> Because of everybody going multi-core, everybody is pushing to go 
> multi-threading to exploit these architectures (e.g. the gaming-world 
> and many more). IIUC you're saying that MPI might better exploit these 
> architectures? Interesting POV!

Multicore has some interesting up sides.  The down sides, 
oversubscription of memory bandwidth for the memory pipes out of the 
sockets, remind me of the days of larger SMP boxes with big busses in 
the early/mid 90s.

First, shared memory is nice and simple as a programming model. 
Multicore suggests that shared memory should be very easy to exploit. 
You have to worry about contention, affinity, and everything else we 
used to have to worry about a decade ago with the big machines.  Your 
precious resources that you need to optimize utilization of are no 
longer CPU cycles, but bandwidth.

Second, MPI is a more complex model.  It forces you to reconsider how 
the algorithm is mapped to the hardware.  And it makes no assumptions 
about the hardware, at least in the API.  In the implementation, it 
might be taught about multi-core, and optimizing communication within 
boxes via shm sockets, and between boxes by other methods.  I think a 
few of the MPI toolkits do this today (Scali, Intel, OpenMPI, ...).

Neither one of these modalities take into account the fact that memory 
bandwidth is finite out of a socket.  Technically this is an 
implementation issue, but as we hit larger and larger core sizes, some 
codes, well, larger fractions of the parallel code base, are likely to 
run into this resource contention issue.

We were seeing contention for fabric interconnects (e.g. bus contention) 
with LAMMPS runs for a customer last year simply between single and dual 
core.  It was significant enough that the customer opted for single 
core.  This contention is not going to get better as you increase the 
number of cores.  Since MPI does, in part, depend upon resources being 
contended for (interconnect), it is not at all clear to me that MPI will 
be the *best* choice for programming all the cores, though it certainly 
would be a simple choice.

Greg is right when he notes that the hybrid model is a challenge. 
Unfortunately we appear to be facing a regime with multiple layers of 
hierarchies.  So this will need resolution.  You can create a globally 
"optimal" code via MPI, that may not be as efficient locally as you 
like, and will likely grow less so with more cores, or a locally optimal 
never-get-out-of-the-box code via shared memory.

Shared memory scales nicely on NUMA machines, assuming 1-2 cores per 
memory controller.  It won't/doesn't scale with 8 cores and one memory 
bus.  How well does stream run on clovertown?  NAS parallel?

The issue is, at the end of the day, the contended for resources.


> t
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615

More information about the Beowulf mailing list