[Beowulf] AMD performance (was 500GB systems)
Bill Broadley
bill at cse.ucdavis.edu
Fri Jan 11 16:22:37 PST 2013
On 01/11/2013 04:01 AM, Joshua mora acosta wrote:
> Hi Bill,
> AMD should pay you for these wise comments ;)
>
> But since this list is about providing feedback, and sharing knowledge, I
> would like to add something to your comments, and somewhat HW agnostic. When
> you are running stream benchmark it is an easy way to find out what the memory
> controllers are capable.
Well it's my own code, last I checked stream didn't do dynamic
allocations or use pthreads. Not to mention various tweaks for NUMA,
affinity, and related.
> Stream does minimal computation, at most the triad but it really exposes the
> bottleneck (in negative terms) or the throughput (in positive terms) of the
> processor and platform (when accounting multiple processors connected by some
> type of fabric: cHT, QPI, network) when looking at the aggregated memory
> bandwidth.
Correct, stream a lousy benchmark to quantify application performance.
Just wanted to counter some comments I've heard about AMD's memory system.
> The main comment I would like to add is with respect to your stream bandwidth
> results. Looking at your log2 chart, it says that AMD delivers about ~100GB/s
> on 4P system and on Intel it delivers ~30GB/s on 2P systems. I may be reading
> wrong in the chart but it should be about 140GB/s with AMD
> (Interlagos/Abudhabi) with 1600MHz DDR3 memory and about 40GB/s with INTEL
> (Nehalem/Westmere) with memory at 1333MHz DDR3 and about 75GB/s with
> Sandybridge with memory at 1600MHz DDR3.
Well in my experience there's 3 major numbers for sequential memory
bandwidth:
1) the marketing numbers (clockspeed * width) which is approximately
50GB per socket for Intel/AMD with 4 channels.
2) Stream returned numbers using good compilers (intel, portland
group, or open64) that only work with static arrays. Often 50-75%
or so of the marketing numbers
3) Stream returned numbers using good compilers using dynamic
allocation (malloc in c or new in c++) often 25-50% of the marketing
numbers. From what I can tell the use of dynamic allocation disables
non-temporal stores.
Gcc usually matches dynamic allocation numbers (#3) with or without
dynamic allocation.
I wonder what percentage of bandwidth intensive codes dynamically
allocate memory.
> In order to do so, you want to use non temporal stores, which bypass the
> regular process of cache coherence. Many applications behave that way since
> you have to do a pass through the data and you may access it again (eg. in the
I believe Intel, Portland Group, and Intel automatically do this, even
when just doing the obvious:
for (j=0; j<N; j++) // where N = large array
c[j] = a[j]+b[j];
Sadly if a,b, or c were dynamically allocated that seems to disable the
non temporal stores.
For instance, open64, openmp, 1831MB array:
Function Rate (MB/s) Avg time Min time Max time
Copy: 101336.5507 0.0135 0.0126 0.0146
Scale: 98265.0155 0.0141 0.0130 0.0153
Add: 103543.0881 0.0202 0.0185 0.0225
Triad: 104677.6852 0.0194 0.0183 0.0213
If I switch to using malloc:
97,99c97
< static double a[N+OFFSET],
< b[N+OFFSET],
< c[N+OFFSET];
---
> static double *a,*b,*c;
134a133,135
> a = (double *) malloc ((N+OFFSET)*sizeof(double));
> b = (double *) malloc ((N+OFFSET)*sizeof(double));
> c = (double *) malloc ((N+OFFSET)*sizeof(double));
Copy: 74228.2843 0.0178 0.0172 0.0185
Scale: 74310.4782 0.0180 0.0172 0.0189
Add: 82776.3594 0.0240 0.0232 0.0249
Triad: 82598.0664 0.0239 0.0232 0.0250
> Finally, I have done a chart of performance/dollar for a wide range of
> processor variants, taking as performance both FLOPs and memory bandwidth and
> assuming equal cost of chassis and amount of memory, dividing the performance
> by the cost of the processor.
I agree that the costs of chassis, ram, motherboard and related are very
similar. But it's seems odd to evaluate price/performance without using
the system (not CPU) price. The best price/perf CPU will be very often
be different than the CPU for the best price/perf node.
While interesting, when making design/purchase decisions I look at
price/performance per node.
> I am attaching it to this email. I took the cost of the processors from
> publicly available information on both AMD and INTEL processors. I know that
> price varies for each deal but as a fair as possible estimate, I get that
> Perf/$ is 2X on AMD than on INTEL, regardless of looking at FLOP/s or GB/s,
> and comparing similar processor models (ie. 8c INTEL vs 16c AMD).
Did you intentionally ignore the current generation AMDs? Personally
I'd find a CPU2006 per $ more interesting (Int or FP rate).
> You can make the chart by yourself if you know how to compute real FLOPs and
> real bandwidth.
Normally I take wall clock time on an application justifying the
purchase of a cluster / cost of node.
More information about the Beowulf
mailing list