[Beowulf] Again about NUMA (numactl and taskset)

Mon Jun 23 08:25:28 PDT 2008

> The questions are
> 1) Is there some way to distribute analogously the local memory of threads (I 
> assume that it have the same size for each thread) using "reasonable" NUMA 
> allocation ?

that is, not surprisingly, the default.  generally, on all NUMA machines,
the starting rule is that memory is allocated for a thread upon "first
touch".  that is, the first thread to touch it, causing a page fault and 
triggering the actual allocation.  (if you allocate memory but never 
touch it, it remains purely virtual, ignoring any book-keeping by your 
memory allocation library, if any.)

> 2) Is it right that using of numactl for applications may gives improvements 
> of performance for the following case:
> the number of application processes is equal to the number of cores of one 
> CPU *AND* the necessary (for application) RAM amount may be placed on one 
> node DIMMs (I assume that RAM is allocated "continously").

you certainly don't want to _deliberately_ create imbalances.
"numactl --hardware" is interesting to see the state of memory allocation.
of course, it reflects only size and free (where free means "wasted" to the
kernel, not the same as "freeable".)

> What will be w/performance (at numactl using) for the case if RAM size 
> required is higher than RAM available per one node, and therefore the program 
> will not use the possibility of (load balanced) simultaneous using of memory 
> controllers on both CPUs ?

non-local memory is modestly slower than local - not dramatically.

> (I also assume also that RAM is allocated 
> continously).

I'm not sure what that means - continuously in time?  or contiguously?
the latter is definitely not true - the allocated memory map for a task
will normally be pretty chopped up, and the virtual addresses will have 
little relation to physical addresses.

> 3) Is there some reason to use things like
> mpirun -np N /usr/bin/numactl <numactl_parameters>  my_application   ?

not that I know.

> 4) If I use malloc()  and don't use numactl, how to understand - from which 
> node Linux will begin the real memory allocation ? (I remember that I assume

if there is free memory on the node where the thread is running, 
that's where the physical page will be allocated.

> that all the RAM is free) And how to understand  where are placed the DIMMs 
> which will corresponds to higher RAM addresses or lower RAM addresses ?

I don't see why userspace would need to know that.  the main question is 
whether non-local allocations are allowed or not, and you set that policy
with numactl --localalloc (or override with --preferred, etc)

> 5) In which cases is it reasonable to switch on "Node memory interleaving" 
> (in BIOS) for the application which uses more memory than is presented on the 
> node ?

I leave it off, since numactl --interleave lets you get the same effect 
from user-space.  I'm not sure I've ever seen it be a win.

> And BTW: if I use taskset -c CPU1,CPU2, ... <program_file>
> and the program_file creates some new processes, will all this processes run 
> only on the same CPUs defined in taskset command ?

afaik, scheduler settings like this are indeed inherited across clone,
possibly also fork.