[Beowulf] AMD performance (was 500GB systems)

Vincent Diepeveen diep at xs4all.nl
Sat Jan 12 07:29:47 PST 2013

On Jan 12, 2013, at 12:15 PM, Bill Broadley wrote:

> On 01/11/2013 05:22 AM, Vincent Diepeveen wrote:>
>>> Bill - a 2 socket system doesn't deliver 512GB ram.
> On 01/11/2013 05:59 AM, Reuti wrote:
>> Maybe I get it wrong, but I was checking these machines recently:
>> IBM's x3550 M4 goes up to 768 GB with 2 CPUs http:// 
>> public.dhe.ibm.com/common/ssi/ecm/en/xsd03131usen/XSD03131USEN.PDF
>> IBM's x3950 X5 goes up to 3 TB with their MAX-5 extension using 4  
>> CPUs, so I assume 1.5 TB with 2 CPUs could work too http:// 
>> public.dhe.ibm.com/common/ssi/ecm/en/xsd03054usen/XSD03054USEN.PDF
> There's plenty of others as well.  Motherboards with 16 dimm slots  
> that
> support 32GB dimms are pretty common.  Supermicro resellers often will
> sell a configuration that supports 512GB ram in a dual socket system.
> However, it's much cheaper (around half the price) is to buy a quad
> socket with 512GB ram, looks like they start at around $8k.
> I updated my memory latency benchmark, and the inner loop is:
>     while (p != 0)
>     {
>       p = a[p];
>       cnt++;
>     }
> My benchmark tests latency by:
> 1) allocating 400GB (2^30 bytes) of 64 bit Ints.
> 2) shuffling them with the knuth shuffle, using drand48 for  
> randomness.
> 3) visits 1 int per cacheline (3.3B or so).
> 4) completes 3,355,443,200 reads in 363.08 seconds (108ns per hop).
> The goal being to make it impossible for prefetch or caches to make  
> the
> main memory latency look lower than it actually is.

Yes i was the inventor of that test to jump using a RNG randomly.
Paul Hsieh then modified it from calling the RNG and correcting for  
the RNG, to the direct pointer math
as you show here.

This test is not so strong however when using multiple cores. It's  
only ok for 1 core.

Their test however wasn't working for all cores at the same time and  
32 bits.

If you are using 32 bits ints however, you'll not be able to address  
more than 16GB ram.
You need 64 bits ints for P.

Setting up the pattern you can do way quicker than Paul Hsieh proposed.
You can directly run around with the RNG setting it up, as we can  
prove that most RNG's that are pretty ok,
don't use the built in RNG, they run around in O ( n log n ).

The distance which you jump around influences the latency a lot that  
you will achieve.

Note in the same manner i designed a lemma that you can create a  
heuristic out of proving from random noise you see,
say in outer space; that if you put it in a huge matrix, and it fills  
it perfectly in O ( n log n), that you do not  see random chatter,
yet you see artificial encrypted data.

> About 40ns of that latency is the constant TLB missing involved in
> randomly accessing 400GB.  The throughput is pretty low because you  
> are
> leaving 15 of 16 memory channels idle at any time.
> However if it's acceptable to split the 400GB into chunks so they  
> can be
> simultaneously read by multiple process/threads you can do  
> substantially
> better.  With 64 cores running flat out, doing the same job the per
> thread latency rises to 199ns.  But since you are running keeping  
> all 16
> channels busy you end up with a cache line lookup every 3.1 ns or so.

200 ns is very fast. Yet your test is crappy.

The glory of DDR3. Yet you write that you visit 3.3 bytes per  
cacheline. Which is less than 16GB of data.

I'm getting more bytes per jump out of every cacheline and in my test  
of course every core can read from the same cache line like other  
cores can.
So i simply splitted it up in 8 bytes chunks the entire RAM.

Parallellizing the above code with direct pointer math is not ok as  
every core jumps in the same manner. So clever caches will predict you.

That's why i initialize every cores RNG with different values and  
jump according to the RNG. Later on i correct for the time it takes  
to calculate the RNG value,
which is under 3 ns anyway at most architectures. Tad more at itanium  
as it doesn't know how to rotate, a crucial thing for most RNG's!

I'm using the ranrot RNG  to jump.

Another issue is the way to measure. I'm not measuring start and stop  
of every core. I let them run for a while, then i take a long  
interval of a second or 30 to measure,
have all cores run another 30 seconds further while the measurement  
goes so that not a bunch of cores can measure while the other cores  
already finished.

Cheating not allowed!

> Try that with a RAID of SSDs ;-).
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list