[Beowulf] AMD performance (was 500GB systems)
Vincent Diepeveen
diep at xs4all.nl
Fri Jan 11 06:13:00 PST 2013
On Jan 11, 2013, at 2:59 PM, Reuti wrote:
> Am 11.01.2013 um 14:22 schrieb Vincent Diepeveen:
>
>> On Jan 11, 2013, at 6:03 AM, Bill Broadley wrote:
>>
>>>
>>> Over the last few months I've been hearing quite a few negative
>>> comments
>>> about AMD. Seems like most of them are extrapolating from desktop
>>> performance.
>>>
>>> Keep in mind that it's quite a stretch going from a desktop (single
>>> socket, 2 memory channels) to a server (dual socket, 4x the cores, 8
>>> memory channels).
>>>
>>
>> Bill - a 2 socket system doesn't deliver 512GB ram.
>
> Maybe I get it wrong, but I was checking these machines recently:
>
> IBM's x3550 M4 goes up to 768 GB with 2 CPUs http://
> public.dhe.ibm.com/common/ssi/ecm/en/xsd03131usen/XSD03131USEN.PDF
Shops selling it say it has a max of 384GB ram.
Gonna be expensive DIMMs btw.
See:
http://www.comcom.nl/p/ibm/default_product/7915d2g/
x3650_m4_xeon_6c_e5_2630_95w/?
=&channel_code=70&product_code=44985452&utm_source=adwords-
generiek&gclid=CI3K4sO84LQCFQRc3godZgwA7Q
>
> IBM's x3950 X5 goes up to 3 TB with their MAX-5 extension using 4
> CPUs, so I assume 1.5 TB with 2 CPUs could work too http://
> public.dhe.ibm.com/common/ssi/ecm/en/xsd03054usen/XSD03054USEN.PDF
$200k a box?
shops here don't offer it. IBM does.
Starts at $120k dollar.
You've got only 128 GB ram then though.
Let's say we multiply that by 4 to get 512GB RAM.
http://www-304.ibm.com/shop/americas/webapp/wcs/stores/servlet/
default/ProductDisplay?
productId=4611686018426177038&storeId=1&langId=-1&categoryId=46116860184
25279711&dualCurrId=73&catalogId=-840
>
> -- Reuti
>
>
>> Your compare at 2 socket domain doesn't make sense for someone who
>> needs 512GB ram,
>> the performance of 4 socket systems is total different from 2.
>>
>> [snip]
>>>
>>> I figured I'd add a few comments:
>>> * Latency for a quad socket AMD is around 64ns to a random piece
>>> of memory (not 600ns as recently mentioned).
>>
>> I wrote a testprogram for this in 2003.
>>
>> You have no idea what TLB trashing accesses are obviously at the
>> hundreds of gigabyte area.
>>
>> There is 0 cheap systems on the planet where you can get a bunch of
>> random bytes in 64 ns
>> from a random spot out of 500GB of RAM, a memory line you previously
>> hadn't opened yet and
>> which with sureness isn't in the cache yet. You will be looking at
>> 400+ ns latencies bestcase.
>>
>> You won't get it faster at any platform which is affordable (of
>> course 512GB of SRAM would be faster,
>> yet let's not go into theoretic discussions here - as you can't
>> afford 512GB of SRAM).
>>
>>> * AMD quad sockets with 512GB ram start around $9k ($USA)
>>
>> You can easily build one with new components from ebay for $2k. Then
>> add the 512GB ram price to that.
>> New from a shop the AMD stuff is dirt cheap as well, as a single core
>> ain't fast of course of the new bulldozer line,
>> offers fully assembled and everything ready working is around $6k
>> mark - excluding 512GB ram of course.
>>
>> Yet it has better latency to a 512 GB block of RAM than intels 4
>> socket systems.
>>
>> And that will be many many hundreds of nanoseconds of course.
>>
>>> * With OpenMP, pthreads, MPI or other parallel friendly code a quad
>>> socket amd can look up random cache line approximately every
>>> 2.25ns.
>>> (64 threads banging on 16 memory channels at once).
>>
>> You still didn't get the picture of TLB trashing software huh?
>>
>> It reads each time from a random memory location. Only at the end of
>> the calculation the search space converges a tad,
>> but still it's random.
>>
>> A measurement i have from a tad older 8 socket intel box here is 700
>> ns for similar TLB trashing behaviour.
>>
>>> * I've seen no problems with the AMD memory system, in general
>>> the 2k pin/4 memory bus amd sockets seem to performance similarly
>>> to Intel.
>>
>> For random accesses at a single or 2 sockets there is huge
>> differences (all cores busy).
>>
>> Intel single socket around 90 ns for my benchmark and bulldozer
>> single socket around 150-170 ns ( 8 cores busy).
>>
>> You really have no idea what 'random' reads are.
>>
>>>
>>> And example of AMD's bandwidth scaling on a quad socket with 64
>>> cores:
>>> http://cse.ucdavis.edu/bill/pstream/bm3-all.png
>>>
>>> I don't have a similar Intel, but I do have a dual socket e5:
>>> http://cse.ucdavis.edu/bill/pstream/e5-2609.png
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>
More information about the Beowulf
mailing list