[Beowulf] programming multicore clusters

Wed Jun 13 22:55:39 PDT 2007

On Wed, Jun 13, 2007 at 07:29:29AM -0700, Joseph Mack NA3T wrote:

> "Most of the folks interested in hybrid models a few years 
> ago have now given it up".
> 
> I assume this was from the era of 2-way SMP nodes.

No, the main place you saw that style was on IBM SPs with
8+ cores/node.

> I expect the programming model will be a little different
> for single image machines like the Altix, than for beowulfs
> where each node has its own kernel (and which I assume will
> be running dual quadcore mobos).

Most Altixes spend most of their time running MPI programs.
Or at least that was certainly the case with Origin.

> Still if a flat, one network model is used, all processes 
> communicate through the off-board networking.

No, the typical MPI implementation does not use off-board networking
for messages to local ranks. You use the same MPI calls, but the
underlying implementation uses shared memory when possible.

> Someone with a 
> quadcore machine, running MPI on a flat network, told me 
> that their application scales poorly to 4 processors. 

Which could be because he's out of memory bandwith, or network
bandwidth, or message rate. There are a lot of postential reasons.

> In a quadcore machine, if 4 OMP/threads processes are 
> started on each quadcore package, could they be rescheduled 
> at the end of their timeslice, on different cores arriving 
> at a cold cache?

Most MPI and OpenMP implementations lock processes to cores for this
very reason.

> In a single image machine (with 
> a single address space) how does the OS know to malloc 
> memory from the on-board memory, rather than some arbitary 
> location (on another board)?

Generally the default is to always malloc memory local to the process.
Linux grew this feature when it started being used on NUMA machines
like the Altix and the Opteron.

> I expect everyone here knows all this. How is everyone going 
> to program the quadcore machines?

Using MPI?

You can go read up on new approaches like UPC, Co-Array Fortran,
Global Arrays, Titanium, Chapel/X-10/Fortress, etc, but MPI is going
to be the market leader for a long time.

-- greg