[Beowulf] Slection from processor choices; Requesting Giudence

Thu Jun 15 08:39:08 PDT 2006

> >   1. One processor at each of the compute nodes
> >   2. Two processors (on one mother board) at each of the compute nodes
> >   3. Two Processors (each one dual-core processor) (total 4 cores on 
> >   4. four processor (on one mother board) at each of the compute nodes.

not considering a 4x2 configuration?

> > Initially, we are deciding to use Gigabit ehternet switch and 1GB of 
> >RAM at
> >each node.

that seems like an odd choice.  it's not much ram, and gigabit is 
extremely slow (relative to alternatives, or in comparison to on-board
memory access.)

> I've heard many times that memory throughput is extremally important 
> in CFD and that using of 1 cpu/1 core per node (or 2 single cores 
> Opteron having independed memory channels) is in some cases better
> than any sharing of memory bus(es).

I've heard that too - it's a shame someone doesn't simply use the 
profiling registers to look at cache hit-rates on these codes...

but I'd be somewhat surprised if modern CFD codes were entirely 
mem-bandwidth-dominated, that is, that they wouldn't make some use 
of the cache.  my very general observation is that it's getting to be 
unusual to encounter code which has as "flat" a memory reference 
pattern as Stream - just iterating over whole swaths of memory 
sequentially.  advances such as mesh adaptation, etc tend to make 
memory references less sequential (more random, but also touching 
fewer overall bytes, and thus possibly more cache-friendly.)

of course, I'm just an armchair CFD'er ;)

in short, it's important not to disregard memory bandwidth, but 
6.4 GB/s is quite a bit, and may not be a problem on a dual-core system
where each core has 1MB L2 to itself.  especially since 1GB/system
implies that the models are not huge in the first place.

that said, I find that CFDers tend not to aspire to running on large 
numbers of processors.  so a cluster of 4x2 machines (which aim to 
run mostly <= 8p jobs on single nodes) might be very nice.  there are 
nice side-effects to having fatter nodes, especially if your workload
is not embarassingly parallel.  

(we should have terminology to describe other levels of parallel coupling -
"mortifyingly parallel", for instance.  I think "shamefully parallel" is
great description of people who wrap serial job in an MPI wrapper
gratuitously, for instance.  and how about "immodestly parallel" for coupled
jobs that scale well, but still somewhat sub-linearly?)