[Beowulf] recommendations for cluster upgrades
Bill Broadley
bill at cse.ucdavis.edu
Wed May 13 22:18:42 PDT 2009
Rahul Nabar wrote:
> On Tue, May 12, 2009 at 7:05 PM, Greg Keller <Greg at keller.net> wrote:
>> Nehalem is a huge step forward for Memory Bandwidth hogs. We have one code
>> that is extremely memory bandwidth sensitive that i
>
> Thanks Greg. A somewhat naiive question I suspect.
>
> What's the best way to test the "memory bandwidth sensitivity" of my
Oprofile I think it is can query various CPU registers that can record things
like cache hits/misses, tlb hits/misses and the like. If I'm misremembering
the name I'm sure someone will speak up. Of course you'd need an idea of what
a node is capable, I suggest using micro benchmarks that exercise the areas of
the memory hierarchy that you are interested in.
> code? Given the fact that I have only the AMD Opterons available
> currently. Any active / passive tests that can throw some metrics out
> at me?
Hrm, having a single platform makes it harder. I've often begged/borrowed
accounts so I could measure application performance, then I run a series of
microbenchmarks to quantify various aspects of the memory hierarchy. Then I
look for correlations between my micro benchmarks and the application performance.
There's a couple other things you can do:
* run 1 to N copies of your code. Great scaling usually means you are CPU
limited/cache friendly. Poor scaling usually is contention in the memory
and/or I/O system.
* If you thing you are doing lots of random memory accesses you can turn on
node/channel/bank interleaving. If performance drops when you go from
4x64 bit channels of memory -> 1x256 bit channel of memory you are likely
limited by the number of simultaneous outstanding memory references.
* Often you can tweak the bios in various ways to increase/decrease bandwidth.
Things like underclocking the memory bus, underclocking the hypertransport
bus, more aggressive ECC scrubbing, various other tweaks available in the
north bridge settings.
* new opterons (shanghai and barcelona) have currently 2 64 bit channels per
socket. If you install the dimms wrong you get 1 64 bit channel, halving
the bandwidth.
* If you pull all the dimms on a socket you halve the node bandwidth (in a
dual socket).
* If you pull a CPU (usually cpu1 not cpu0 should be pulled) I believe the
coherency traffic goes away and the latency to memory drops by 15ns or so,
certainly if that makes a difference in your application runtime you are
rather latency sensitive.
So basically with the above you should be able to play with parallel
outstanding requests from 1 to 4 per system, bandwidth at 25, 50, and 100% of
normal, and a bit more with the other tweaks. I recommend some micro
benchmarks to look at the underlying memory performance then look at
application runtimes to see how sensitive you are.
More information about the Beowulf
mailing list