[Beowulf] programming multicore clusters

Fri Jun 15 06:00:24 PDT 2007

For the forseeable future, I'm not developing much but will use the 
hybrid SMP/DM capabilities in WRF.  Takes advantage of SMP availability, 
and supports message passing between SMP nodes.  I've not used this 
capability for benchmarking but it appears to offer significant gains.

As we get more hybrid HPC capabilities planning for this will be more 
important.  A lot of system administrators (based on a statistical 
sample of 4 local) have decreed that this is inefficient and one should 
either do isolated shared memory or distributed memory so that we don't 
make our Gaussian users feel unloved.  I'm skeptical.

gerry

Joseph Mack NA3T wrote:
> I've googled the internet and searched the Beowulf archives
> for "hybrid" || "multicore" and the only definitive statement I've found 
> is by Greg Lindahl, 17 Dec 2004
> 
> "Most of the folks interested in hybrid models a few years ago have now 
> given it up".
> 
> I assume this was from the era of 2-way SMP nodes.
> 
> Multicore CPUs are being projected for 15yrs into the future (statement 
> by Pat Gelsinger, Intel's CTO, quoted in
> http://cook.rfe.org/grid.pdf)
> 
> I expect the programming model will be a little different
> for single image machines like the Altix, than for beowulfs
> where each node has its own kernel (and which I assume will
> be running dual quadcore mobos).
> 
> Still if a flat, one network model is used, all processes communicate 
> through the off-board networking. Someone with a quadcore machine, 
> running MPI on a flat network, told me that their application scales 
> poorly to 4 processors. Instead if processes on cores within a package 
> were working on adjacent parts of the compute volume and communicated 
> through the on-board networking, then for a quadcore machine, the 
> off-board networking bandwidth requirement would drop by a factor of 4 
> and scaling would improve.
> 
> In a quadcore machine, if 4 OMP/threads processes are started on each 
> quadcore package, could they be rescheduled at the end of their 
> timeslice, on different cores arriving at a cold cache? On a large 
> single image machine, could a thread be scheduled on another node and 
> have to communicate over the off-board network? In a single image 
> machine (with a single address space) how does the OS know to malloc 
> memory from the on-board memory, rather than some arbitary location (on 
> another board)?
> 
> I expect everyone here knows all this. How is everyone going to program 
> the quadcore machines?
> 
> Thanks Joe

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843