[Beowulf] AMD performance (was 500GB systems)

Fri Jan 11 04:01:29 PST 2013

Hi Bill, 
AMD should pay you for these wise comments ;) 

But since this list is about providing feedback, and sharing knowledge, I
would like to add something to your comments, and somewhat HW agnostic. When
you are running stream benchmark it is an easy way to find out what the memory
controllers are capable. 

More down to the usage of that, it translates for a wide variety of
applications in terms of data processing throughput , and therefore into the
real application's performance, because data is stored in RAM , fetched into
caches, processed by cores and then returned to caches to be finally evicted
back to RAM while bringing new chunks of data into cache, until the whole data
set is processed. 

Stream does minimal computation, at most the triad but it really exposes the
bottleneck (in negative terms) or the throughput (in positive terms) of the
processor and platform (when accounting multiple processors connected by some
type of fabric: cHT, QPI, network) when looking at the aggregated memory
bandwidth. 

The main comment I would like to add is with respect to your stream bandwidth
results. Looking at your log2 chart, it says that AMD delivers about ~100GB/s
on 4P system and on Intel it delivers ~30GB/s on 2P systems. I may be reading
wrong in the chart but it should be about 140GB/s with AMD
(Interlagos/Abudhabi) with 1600MHz DDR3 memory and about 40GB/s with INTEL
(Nehalem/Westmere) with memory at 1333MHz DDR3 and about 75GB/s with
Sandybridge with memory at 1600MHz DDR3. 

In order to achieve such significantly higher memory bandwidth for this
specific benchmark and there is where I want people to realize is that the
data is used only once. There is a loop to repeat the experiment and average
timings but in terms of processing , the data is only used once and then you
bring a new chunk of data. In other words, there is no reusage of the data in
the "near term". Therefore, you do want to boost the processing by getting rid
of the data already processed by evicting it directly from the levels of cache
closer to the core directly into RAM and bringing new fresh data from RAM into
the caches rather than evicting the data recently processed into caches,
wasting precious space to store data you dont need "for the time being". If
you bypass the normal mechanism you are improving the amount of new data
fetched into caches while storing quickly the crunched data into RAM. 

In order to do so, you want to use non temporal stores, which bypass the
regular process of cache coherence. Many applications behave that way since
you have to do a pass through the data and you may access it again (eg. in the
next iteration) but after you have processed a bunch of more data (eg. on
current iteration), hence preventing the cache to keep that data close. Better
to get rid of it and bring it again when needed. If you do so, on those
applications that are not cache friendly, which is the opposite to what I just
described, you will improve greatly the performance of your applications. 

Finally, I have done a chart of performance/dollar for a wide range of
processor variants, taking as performance both FLOPs and memory bandwidth and
assuming equal cost of chassis and amount of memory, dividing the performance
by the cost of the processor. 
I am attaching it to this email. I took the cost of the processors from
publicly available information on both AMD and INTEL processors. I know that
price varies for each deal but as a fair as possible estimate, I get that
Perf/$ is 2X on AMD than on INTEL, regardless of looking at FLOP/s or GB/s,
and comparing similar processor models (ie. 8c INTEL vs 16c AMD). 

You can make the chart by yourself if you know how to compute real FLOPs and
real bandwidth. I also did the funny exercise to halve the price of the Intel
processors (eg. 50% discount) and then the lines of Perf/USD of Intel went to
match the lines of AMD, ie. to become Perf/USD competitive or on par without
having to discount on AMD. 

Best regards, Joshua Mora