[Beowulf] Which Xeon supports NUMA?
Mark Hahn
hahn at mcmaster.ca
Tue Mar 18 14:00:10 PDT 2008
> given that many core Xeons (especially quad and/or many socket systems) have
> some memory speed issues. With NUMA the kernel seems to be able to optimize
> this somehow.
I don't believe so. Intel currently still uses a single memory controller (MCH),
which means that memory access is, in the NUMA sense, uniform. I don't
believe that Intel's recent use of multiple socket-MCH links, or multiple
independent FBDIMM channels off the MCH change this.
here's an Intel It2 chipset:
http://www.intel.com/products/i/chipsets/e8870sp/e8870_blkdiag_8way_800.jpg
you can see that there are two FSB's with 4cpus each. a CPU on the left
will have non-uniform access to a memory bank which happens to be on the
right side of the system. I don't believe any of the Intel x86 chipsets
provide this kind of design, though several other companies have done
numa x86 chipsets (IBM for one).
the interesting thing is that Intel has decided to embrace the numa-oriented
system architecture of AMD (et al). it'll be very interesting to see how
this plays out with Nehalem/QPI. obviously, AMD really, really needs to wake
up and try a little harder to complete...
> (2) More importantly, has someone measured (how?) if this improves
> performance?
usually, tuning for NUMA just means trying to keep a process near its memory.
in the chipset above, if a proc starts on the left half, make an effort to
allocate its memory on the left as well, and keep scheduling it on left cpus.
the kernel does contain code that tries to understand this topology - the
most common machines that use it are multi-socket opteron boxes. but systems
like SGI Altix depend on this sort of thing quite heavily.
following is a trivial measurement of the effect. I'm running the stream
benchmark on a single thread. in the first case, I force the process and
memory to be on the same socket. then the "wrong" socket.
[hahn at rb17 ~]$ numactl --membind=0 --cpubind=0 ./s
...
The total memory requirement is 1144 MB
You are running each test 11 times
...
Function Rate (MB/s) Avg time Min time Max time
Copy: 5298.8324 0.1515 0.1510 0.1520
Scale: 5334.1523 0.1504 0.1500 0.1510
Add: 5455.4020 0.2200 0.2200 0.2200
Triad: 5455.3902 0.2200 0.2200 0.2200
...
[hahn at rb17 ~]$ numactl --membind=0 --cpubind=1 ./s
...
Function Rate (MB/s) Avg time Min time Max time
Copy: 3556.1072 0.2253 0.2250 0.2260
Scale: 3620.4688 0.2213 0.2210 0.2220
Add: 3647.9716 0.3305 0.3289 0.3310
Triad: 3659.0890 0.3305 0.3280 0.3310
note that NUMA optimizations are a wonderful thing, but hardly a panacea.
for instance, a busy system might not be able to put all a proc's memory
on a particular node. or perhaps the cpus of that node are busy. and then
think about multithreaded programs. on top of that, consider caches, which
these days are variously per-core, per-chip and per-socket.
> Thanks for a brief answer
oh, sorry ;)
More information about the Beowulf
mailing list