[Beowulf] More cores/More processors/More nodes?

Sat Sep 30 13:53:01 PDT 2006

> It seems there are at least 3 dimensions for expansion.  What (in your
> opinion) is the right tradeoff between more cores, more processors and
> more
> individual compute nodes?

I'd claim this is not a matter of opinion, but rather a matter of which
things matter most to you: memory bandwidth or capacity, density,
interconnect bandwidth, perhaps even disk IO bandwidth.

> In particular, I am thinking of in-house parallel finite difference /
> finite element codes,
> parallel BLAS, and maybe some commercial Monte-Carlo codes (the last
> being an
> embarrassingly parallel problem).

montecarlo, from what I see, is both emb-par and tiny, so really just wants 
lots of cores, little memory, light interconnect, etc.

but that's an extreme; more generally the right choice depends on issues like 
how cache-friendly the code is (thus less sensitive to the
core-to-memory-bandwidth ratio), whether on-node shared memory is 
a big win (still faster than interonnect, easier to program), whether 
memory _capacity_ is more of an issue (which with AMD leads to more 
sockets/node), etc.

it does seem like finite-element stuff tends to have relatively 
high work-to-surface-area, so is not terribly demanding of interconnect
(cheaper interconnect, and less harm from multiple cores per node).
similarly, higher levels of blas are less demanding of mem-bw.

> I have been set the task of building our first cluster for these
> applications.
> Our existing in-house codes run on an SGI machine with a parallelizing
> compiler.
> They would need to be ported to use MPI on a cluster.

would they?  have you considered whether they'd run well on something 
like an 8-socket, 16-core AMD system?  I'm guessing the SGI is an older
mips-based Origin, and thus has dramatically slower CPUs.

by "parallelizing compiler" do you mean OpenMPI?

> However, I do not
> understand
> what happens when you have multi-processor/multi-core nodes in a
> cluster.  Do you
> just use MPI (with each thread using its own non-shared memory) or is
> there any
> way to do "mixed-mode" programming which takes advantage of shared
> memory within a
> node (like, an MPI/OpenMP hybrid?).

sure, all the memory in a node is shared, so you can use threads or other
shared-memory techniques if you want.  but this takes lots of additional
effort.  is it worth it?  bear in mind that any MPI will take some advantage
of faster access to a peer which happens to be on the same node.  and 
there are some packages (eg goto-blas) which can use threads internally,
and thus give you speedup even if you don't explicitly program the threads.

I don't see anyone bothering with this on our clusters - people who make the 
jump to MPI tend not to care about small factors like 2 vs 4 cores/node,
since they're aiming at 3-digit core counts.  it's also easier to schedule
an n-way MPI job that has no requirements about the layout of workers,
versus one which would require all the cpus on all of its nodes.

for your transition, I would guess you need a combo-cluster: some nice fat
nodes, as well as a decent-sized set of MPI-friendly ones.  you really need
to investigate your workload to figure out whether you can use gigabit
everywhere (surprisingly effective, even for serious MPI that's not emb-par)
or whether you need to step up to a real HPC interconnect (to me, that would 
be either InfiniPath or Myrinet-10G.)

regards, mark hahn.