[Beowulf] Benchmark between Dell Poweredge 1950 And 1435

Thu Mar 8 14:40:37 PST 2007

As Robert Brown (and others) so eloquently said.  Nothing is better than your 
actual application with your actual input files in an actual production run.

Results vary widely, and any kind of general statement could easily be proven
significantly wrong in your specific case.

Additional things to keep in mind:
* Compilers can make a huge difference.  Intel for instance used to penalize
   AMD chips on the order of 5-15% with their compiler.  This was proven by
   removing the if (running_on_amd()) check and seeing the improved
   performance.  Other compilers will achieve different performance because
   they achieve a different percentage of peak performance.  Pathscale
   in particular seems to sometimes achieve great performance and at other
   times just average performance.... highly code dependent.
* Intel and AMD have dramatically different cache and memory architectures.
   Make sure your runs are as close to real world usage as possible.  In
   particular single thread performance on a dual socket dual core node can
   behave dramatically different than running 4 threads on a dual socket dual
   core node.
* Performance of a single application can change radically based on
   performance.  Intel for instance might win on your application with
   a "benchmark" dataset that runs quickly, but run more poorly on
   a real dataset that is more memory intensive.  Then again some product
   codes/datasets will run dramatically better on the intel chips.

In general Intel wins many floating point single thread codes, their 4MB
of L2 (vs 1MB on AMD) and 7-9GB/sec memory system can keep up with the
demands of the single thread well enough to leverage the generally higher
floating point performance.  SpecFP2000 isn't a terrible way to measure
this (again not nearly as nice as running your own application).

In the 4 thread case several factors cause the intel chip to scale poorly,
the L2 cache is shared so you get 2MB per core (instead of 4) AND the
cache can't meet the needs of 2 cores hitting L2 flat out.  Then as
you fall out of (the smaller) cache the memory system doesn't scale.  I've
yet to identify why, but the advertised "dual frontside bus" seems to
improve bandwidth by about 0% compared to the rather poor throughput of
the last generation netburst shared FSB.  So despite a significant gain
in cores (double), work done per cycle (about double) the current generation 
Intel chips have no more memory bandwidth than the previous generation.  I
played with various BIOS settings (cache snooping and related) with
zero improvement in the observed numbers.

If intel has somehow fixed this please post to the list, despite
having 2 128 bit memory interfaces, and 2 frontside busses I've yet
to see a case where the bandwidth improves (let along doubles).

If you look at the Spec2000 FP Rate benchmarks you'll see that despite
a substantial lead in single thread performance that the system performance
is just about dead even with the opteron.  Spec2000 isn't exactly
a current benchmark and was intended for systems with relatively
little ram (256 or 512MB if memory serves), any number of real world
applications could be significantly more memory intensive than the old
spec.

So all the above is just so much handwaving, any of dozens of factors
could double of halve performance on your application, get out a stop
watch and run it.  I suspect any number of vendors or even fellow beowulf
list folks would either run your application code or allow you to run it.

For a wide mix of applications in the past I've leaned towards AMD because
my real world testing showed AMD usually won.  The gap has closed 
significantly in the last year (it used to be so embarrassing).  Today
I'd call it mostly a wash.  Things are shaping up to be pretty interesting,
AMD has the opportunity to take a commanding lead with their next generation
chip which rumors claim will be shipping this summer.

The bad news is that while AMD's next generation promises dramatically better
work done per cycle, the memory system doesn't look like it's going to
get much (if any) more memory bandwidth.