[Beowulf] Which Xeon supports NUMA?

Tue Mar 18 14:00:10 PDT 2008

> given that many core Xeons (especially quad and/or many socket systems) have 
> some memory speed issues. With NUMA the kernel seems to be able to optimize 
> this somehow.

I don't believe so.  Intel currently still uses a single memory controller (MCH),
which means that memory access is, in the NUMA sense, uniform.  I don't
believe that Intel's recent use of multiple socket-MCH links, or multiple
independent FBDIMM channels off the MCH change this.

here's an Intel It2 chipset:
http://www.intel.com/products/i/chipsets/e8870sp/e8870_blkdiag_8way_800.jpg
you can see that there are two FSB's with 4cpus each.  a CPU on the left
will have non-uniform access to a memory bank which happens to be on the 
right side of the system.  I don't believe any of the Intel x86 chipsets 
provide this kind of design, though several other companies have done 
numa x86 chipsets (IBM for one).

the interesting thing is that Intel has decided to embrace the numa-oriented
system architecture of AMD (et al).  it'll be very interesting to see how
this plays out with Nehalem/QPI.  obviously, AMD really, really needs to wake
up and try a little harder to complete...

> (2) More importantly, has someone measured (how?) if this improves 
> performance?

usually, tuning for NUMA just means trying to keep a process near its memory.
in the chipset above, if a proc starts on the left half, make an effort to 
allocate its memory on the left as well, and keep scheduling it on left cpus.
the kernel does contain code that tries to understand this topology - the
most common machines that use it are multi-socket opteron boxes.  but systems
like SGI Altix depend on this sort of thing quite heavily.

following is a trivial measurement of the effect.  I'm running the stream
benchmark on a single thread.  in the first case, I force the process and 
memory to be on the same socket.  then the "wrong" socket.

[hahn at rb17 ~]$ numactl --membind=0 --cpubind=0 ./s
...
  The total memory requirement is 1144 MB
  You are running each test  11 times
...
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:       5298.8324      0.1515      0.1510      0.1520
Scale:      5334.1523      0.1504      0.1500      0.1510
Add:        5455.4020      0.2200      0.2200      0.2200
Triad:      5455.3902      0.2200      0.2200      0.2200
...
[hahn at rb17 ~]$ numactl --membind=0 --cpubind=1 ./s
...
Function     Rate (MB/s)  Avg time   Min time  Max time
Copy:       3556.1072      0.2253      0.2250      0.2260
Scale:      3620.4688      0.2213      0.2210      0.2220
Add:        3647.9716      0.3305      0.3289      0.3310
Triad:      3659.0890      0.3305      0.3280      0.3310

note that NUMA optimizations are a wonderful thing, but hardly a panacea.
for instance, a busy system might not be able to put all a proc's memory
on a particular node.  or perhaps the cpus of that node are busy.  and then
think about multithreaded programs.  on top of that, consider caches, which
these days are variously per-core, per-chip and per-socket.

> Thanks for a brief answer

oh, sorry ;)