[Beowulf] recommendations for cluster upgrades

Wed May 13 22:18:42 PDT 2009

Rahul Nabar wrote:
> On Tue, May 12, 2009 at 7:05 PM, Greg Keller <Greg at keller.net> wrote:
>> Nehalem is a huge step forward for Memory Bandwidth hogs.  We have one code
>> that is extremely memory bandwidth sensitive that i
> 
> Thanks Greg. A somewhat naiive question I suspect.
> 
> What's the best way to test the "memory bandwidth sensitivity" of my

Oprofile I think it is can query various CPU registers that can record things
like cache hits/misses, tlb hits/misses and the like.  If I'm misremembering
the name I'm sure someone will speak up.  Of course you'd need an idea of what
a node is capable, I suggest using micro benchmarks that exercise the areas of
the memory hierarchy that you are interested in.

> code? Given the fact that I have only the AMD Opterons available
> currently. Any active / passive tests that  can throw some metrics out
> at me?

Hrm, having a single platform makes it harder.  I've often begged/borrowed
accounts so I could measure application performance, then I run a series of
microbenchmarks to quantify various aspects of the memory hierarchy.  Then I
look for correlations between my micro benchmarks and the application performance.

There's a couple other things you can do:
* run 1 to N copies of your code.  Great scaling usually means you are CPU
  limited/cache friendly.  Poor scaling usually is contention in the memory
  and/or I/O system.
* If you thing you are doing lots of random memory accesses you can turn on
  node/channel/bank interleaving.  If performance drops when you go from
  4x64 bit channels of memory -> 1x256 bit channel of memory you are likely
  limited by the number of simultaneous outstanding memory references.
* Often you can tweak the bios in various ways to increase/decrease bandwidth.
  Things like underclocking the memory bus, underclocking the hypertransport
  bus, more aggressive ECC scrubbing, various other tweaks available in the
  north bridge settings.
* new opterons (shanghai and barcelona) have currently 2 64 bit channels per
  socket.  If you install the dimms wrong you get 1 64 bit channel, halving
  the bandwidth.
* If you pull all the dimms on a socket you halve the node bandwidth (in a
  dual socket).
* If you pull a CPU (usually cpu1 not cpu0 should be pulled) I believe the
  coherency traffic goes away and the latency to memory drops by 15ns or so,
  certainly if that makes a difference in your application runtime you are
  rather latency sensitive.

So basically with the above you should be able to play with parallel
outstanding requests from 1 to 4 per system, bandwidth at 25, 50, and 100% of
normal, and a bit more with the other tweaks.  I recommend some micro
benchmarks to look at the underlying memory performance then look at
application runtimes to see how sensitive you are.