DUAL CPU board vs 2 Single CPU boards: bang for buck?

Thu Mar 7 07:39:43 PST 2002

On Thu, 7 Mar 2002, Jim Fraser wrote:

> I don't desire to start a flame but it seems to me that for computational
> intensive and memory intensive work that 2 singles are better and in most
> cases cheaper then a dual setup.  The dual SMP systems out there now have to
> fight for bandwidth along the same bus.  The AMD chips while fast appear to
> be starved for data when *big* memory jobs are running.  Further I don't see
> the cost benefits, if you actually dig into the "bang-for-buck" duals never
> seem to win.

That's fine with me;-) Don't consider the following a flame, then, but a
respectful disagreement and openhearted list discussion...:-)

Let's do the arithmetic and compare.  A dual typically shares a case, a
hard disk and a NIC (and possibly a video card depending on whether your
operation chooses to use serial consoles).  Yes, there are
configurations that might leave out a hard disk, but let's just assume
that we need/want one (in our case we do, and in any event it makes a
node a teeny bit easier to engineer and install IMHO, and makes the
nodes a teeny bit faster at less cost to the network and a centralized
server).

A case costs between $60 (mid-tower) and $350 (rackmount).  Let's for
the sake of argument assume rackmount 2U cases (we currently require
rackmount cases to be able to pack LOTS of nodes into a limited and
expensive space) and arbitrarily assign a cost of $250.  A "reasonable"
node hard disk is $100.  A NIC might be included in the motherboard or
purchased separately.  We'll try to accomodate it either way assigning a
cost of $50 (typical for an OTC 3c905, for example, although we might
find cheaper ones).  Using GbE instead of (or in addition to) 100BT
would add about $125 for a high-end controller, or as little as $30-50
for a low end controller; Myrinet would add a LOT, but we'll ignore both
of these possibilities as well as the possible need for a video
controller.  Adding up what we pretty much MUST have gives us an
"overhead" cost of $400 for a 2U rackmount system regardless of what we
put in it.  Just to avoid argument over the best possible price vs a
reasonable OTC price, I'll arbitrarily deduct $50 from this and make the
unit overhead $350 for case, disk, and marginal cost of a single NIC.
Any additional hardware requirements per box (as opposed to per
processor) obviously get added to this.

We can choose to put a UP motherboard, processor and memory in it.  A
"typical" configuration might be a $100 motherboard without onboard NIC
or a $150 motherboard with onboard NIC -- but why quibble within $20 or
so relative to our $50 estimate for NIC cost above.  Let's choose $120,
since the cheaper UP motherboards can be a bit cheesy.  A high end
Athlon is in the $250-300 range, let's say $280.  Memory prices have
been rapidly varying recently (in the good direction).  It won't really
matter what we use since we'll spend the same amount in either packaging
for a given amount per CPU -- let's choose one 512 MB DDR ECC DIMM at
around $200.  Our UP system thus costs

  $350
  $120
  $280
  $200
======
  $950

and two of them cost $1900.

Alternatively, we could use a dual motherboard, e.g. Tiger 2466, which
costs about $220 (and definitely has a built in NIC).  Our system then
might cost

  $350
  $220
  $560
  $400
======
 $1530

We see that we save well over $150 per processor by getting duals
instead of singles.  In reality, the savings are even greater -- the
extra case, power supply, motherboard and disk consume an extra 50 Watts
or so (including the cost of AC), and (at $0.06 kW/hr over a year) this
will add an extra $20+/year to your operational costs per pair.  The
additional maintenance costs associated with the extra case, power
supply and fan will add ANOTHER $20 per pair.  A fairer estimate of
marginal operational cost over a three year lifetime would be perhaps
$250, remembering that we were excessively generous with the cost per
case in the first place.  Still, let's stick with $150 out of $950 or
roughly 15% cost differential in favor of dual packaging.

Now, is this worth it?  No use begging the question (as you attempt to
do above, no offense or flame intended;-) by starting with
"computationally intensive and memory intensive" since these are
QUANTITATIVE concepts, not qualitative, and will be distributed
differently, in detail, for each distinct application.  Your mileage
here may vary with a vengeance -- the only way to proceed is to analyze
the specific HPC tasks to be run on the hardware in question.  Remember,
somebody might TAKE your advice and engineer a cluster based on singles
when it really isn't the best thing for them to do.

For CPU bound tasks that tend to remain fairly local in cache (so they
go to memory relatively infrequently and in a bursty way) two threads on
a dual will often complete in just about exactly the same about of time
as two threads on two singles.  They certainly do for my Monte Carlo
computations, and have on every CPU/memory packaging back to plain old
EDO RAM on Pentium Pros six or seven years ago.  A dual packaging
produces NO time of completion penalty and saves you $150/cpu.  Every
six processors you buy in a dual packaging gets you roughly one
processor for FREE relative to a single packaging, and you finish all
your work in roughly 85% of the single packaging time for a given fixed
budget.

In this case, buying singles is obviously not the optimal choice.  This
is not a rare occurence by any stretch of the imagination.  It may even
be the most common situation for "most" cluster users (I personally
think that it is, but my evidence is local and anecdotal).

OTOH, if one is running a significantly MEMORY bound task, one has to be
very careful.  If it really is a stream-like task (multiplying lots of
big matrices and vectors, lots of memory intensive linear algebra) then
two processors may well collide on the memory bus and reduce the
effective throughput of the dual relative to two singles.  This is a
>>difficult<< thing to categorically pronounce upon, however, as the
degree of degradation (if any at all) depends in detail FOR EACH
APPLICATION (and possibly even on different parameters within ONE
application) on things like the stride, just how much computation occurs
after a given cache-filling memory access, and how well the scheduler
effectively antibunches these memory accesses in the running code so
that they don't or minimally collide.  Some jobs that you think might be
memory bound organize themselves in operation so that cpu 0 is accessing
memory while cpu 1 is computing and vice versa so even though they COULD
collide and degrade, they tend not to.  Or they appear to collide when
run in small prototyping runs but in larger runs they don't, or vice
versa.  Others are sufficiently random or evilly-patterned that even
though in principle there is enough total bandwidth, in practice one CPU
often has to wait on the other.  The only safe thing to do is to MEASURE
your particular job's performance on the two alternatives (preferrably
at all interesting operational scales!) and see how single-thread/single
CPU completion times compare to double thread/double CPU completion
times.

Still, it is entirely possible that performance will degrade more than
the 15% cost differential.  In that case one should very likely select
single CPU packaging.

In the above we've neglected the other most important component of
parallel system design -- IPC costs.  A dual packaging generally
requires two processors to share a network IPC channel, although it may
be that duals and busses are now fast enough that a dual can run two
100BT channels as fast as two singles with a 100BT NIC apiece, I haven't
measured recently.  Last time I >>did<< measure, there was maybe a 10%
degradation of peak speed on good quality hardware, more on cheapie
stuff.  A possible degradation in peak IPC rates per processor between
nodes has to be balanced a bit with memory-based IPCs between CPUs on a
single node, which may be a significant advantage for certain tasks.  It
is thus VERY hard to guestimate whether a dual packaging is good or bad
for your real, parallel HPC application with nontrivial internode IPCs
and again the safest thing to do is prototype both ways and measure, if
at all possible.

The second safest thing to do is ask the list to see if somebody is
running your particular favorite application or one very much like it.
Sometimes you'll get the measurements you need from somebody who did the
prototyping for you.  Sometimes you'll get offers from nice people to
let you run a benchmark of your application on their hardware.  Either
one can save you from an expensive mistake even if you can't find a
friendly vendor to loan or the money to build a system each way and test
before buying 64 CPUs (possibly the "wrong way").

To conclude, the cost-benefit analysis outlined above is an ESSENTIAL
part of successful cluster engineering.  To do the best possible job,
one has to reject the notion that "duals are good" or "duals are bad"
and instead analyze the actual price/performance of duals for your
particular task mix in comparison to two singles.  Dual packaging is
without question cheaper, per CPU, than two singles -- the total cost of
CPU, memory and motherboard generally works out to be comparable, but it
saves you the cost of everything (else) that can be shared.  Whether the
cost savings translates into a reduced time of completion of your task
for a given fixed dollar budget is a question that can be answered
systematically by means of measurements or (very:-) well-informed
estimates.

The exact same questions need to be raised repeatedly even within single
or dual CPU designs -- does your application NEED DDR or RDRAM to
perform well?  Many applications are CPU bound and run just fine on
SDRAM or even EDO.  Do you NEED a 64/66 PCI bus or will a boring old
32/33 PCI bus suffice?  Do you NEED a local disk or can you run
diskless?  Do you NEED switched GbE or would 10BT thickwire with good
old vampire taps (or even sneakernet:-) let you finish in the same
amount of time?

A well engineered beowulf or cluster is one that accomplishes the
task(s) for which it was designed in the least amount of time, for the
least possible cost (looking at TOTAL cost, including operational costs
and administrative costs, which are yet ANOTHER thing often ignored:-).
This may be a "standard recipe" general purpose cluster -- a pile of
cheap 256 MB nodes with switched 100BT -- or it may be fancy-schmancy
nodes with big-cache Xeons, superfast disks, the fastest possible PCI
bus and Myrinet controllers, the largest and fastest memory
configuration possible.  It may be as many nodes as one can afford or it
may even be NO MORE than 32 nodes (because more than that simply won't
scale for your problem) with the rest of your budget spent on upgrading
your network to the fastest possible, since this is what limits the
number of nodes you can use.

For what it is worth, our tasks here at Duke are arguably "HPC" tasks,
although some might disagree.  However one feels about the epistomology
of the word, I personally burn GFLOP-years of computing routinely as do
several others in the department using beowulfish compute clusters --
but since they tend to be CPU bound dual packaging makes the best sense
FOR US.  Not necessarily for you or anybody else.  We do get a handful
of single CPU nodes to mix in with our duals in case somebody needs to
run some specific NEW task that might turn out to be memory bound, but
if such as task moves beyond prototyping and into production we expect
the owner of the task to pony up some bucks for new nodes engineered to
the task.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu