[Beowulf] NUMA info request

Wed Mar 26 02:33:08 PDT 2008

On Tue, Mar 25, 2008 at 12:40 PM,  <kyron at neuralbs.com> wrote:
>
> > On Tue, Mar 25, 2008 at 12:17 AM, Eric Thibodeau <kyron at neuralbs.com>
>  > wrote:
>  >>
>  >>  Mark Hahn wrote:
>  >>  >>   NUMA is an acronym meaning Non Uniform Memory Access. This is a
>  >>  >> hardware constraint and is not a "performance" switch you turn on.
>  >>  >> Under the Linux
>  >>  >
>  >>  > I don't agree.  NUMA is indeed a description of hardware.  I'm not
>  >>  > sure what you meant by "constraint" - NUMA is not some kind of
>  >>  > shortcoming.
>  >>  Mark is right, my choice of words is misleading. By constraint I meant
>  >>  that you have to be conscious of what ends up where (that was the point
>  >>  of the link I added in my e-mail ;P )
>  >>
>  >> >> kernel there is an option that is meant to tell the kernel to be
>  >>  >> conscious about that hardware fact and attempt to help it optimize
>  >>  >> the way it maps the memory allocation to a task Vs the processor the
>  >>  >> given task will be using (processor affinity, check out taskset (in
>  >>  >> recent util-linux implementations, ie: 2.13+).
>  >>  > the kernel has had various forms of NUMA and socket affinity for a
>  >>  > long time,
>  >>  > and I suspect most any distro will install kernel which has the
>  >>  > appropriate support (surely any x86_64 kernel would have NUMA
>  >> support).
>  >>  My point of view on distro kernels is that they are to be scrutinized
>  >>  unless they are specifically meant to be used as computation nodes (ie:
>  >>  don't expect CONFIG_HZ=100 to be set on "typical" distros).
>  >>  Also, NUMA is only applicable to Opteron architecture (internal MMU
>  >> with
>  >>  HyperTransport), not the Intel flavor of multi-core CPUs (external MMU,
>  >>  which can be a single bus or any memory access scheme as dictated by
>  >> the
>  >>  motherboard manufacturer).
>  >>
>  >> >
>  >>  > I usually use numactl rather than taskset.  I'm not sure of the
>  >>  > history of those tools.  as far as I can tell, taskset only addresses
>  >>  > numactl --cpubind,
>  >>  > though they obviously approach things differently.  if you're going
>  >> to
>  >>  > use taskset, you'll want to set cpu affinity to multiple cpus (those
>  >>  > local to a socket, or 'node' in numactl terms.)
>  >>  >
>  >>  >>   In your specific case, you would have 4Gigs per CPU and would want
>  >>  >> to make sure each task (assuming one per CPU) stays on the same CPU
>  >>  >> all the time and would want to make sure each task fits within the
>  >>  >> "local" 4Gig.
>  >>  >
>  >>  > "numactl --localalloc".
>  >>  >
>  >>  > but you should first verify that your machines actually do have the
>  >> 8GB
>  >>  > split across both nodes.  it's not that uncommon to see an
>  >>  > inexperienced assembler fill up one node before going onto the next,
>  >>  > and there have even
>  >>  > been some boards which provided no memory to the second node.
>  >>  Mark (Hahn) is right (again !), I ASSumed the tech would load the
>  >> memory
>  >>  banks appropriately, don't make that mistake ;) And numactl is indeed
>  >>  more appropriate in this case (thanks Mr. Hahn ;) ). Note that the
>  >>  kernel (configured with NUMA) _will_ attempt to allocate the memory to
>  >>  "'local nodes" before offloading to memory "abroad".
>  >>
>  >>  Eric
>  >>
>  > The memory will be installed by myself correctly - that is,
>  > distributing the memory according to cpu.  However, it appears that
>  > one of my nodes (my first Opteron machine) may well be one that has
>  > only one bank of four DIMM slots assigned to cpu 0 and shared by cpu
>  > 1.  It uses a Tyan K8W Tiger s2875 motherboard.  My other two nodes
>  > use Arima HDAMA motherboards with SATA support - each cpu has a bank
>  > of 4 DIMMs associated with it.  The Tyan node is getting 4 @ 2 Gb
>  > DIMMs, one of the HDAMA nodes is getting 8 @ 1 Gb (both instances
>  > fully populating the available DIMM slots) and the last machine is
>  > going to get 4 @ 1 Gb DIMMs for one cpu and 2 @ 2 Gb for the other.
>
>  That last scheme might give you some unbalanced performance but that is
>  something to look up with the MB's instruction manual (ie: you might be
>  better off installing the RAM as 1G+1G+2G for both CPUs instead of 4x1G +
>  2x2G).

On my Opteron systems, wouldn't 3 DIMMs per CPU drop me into 64-bit
memory bandwidth rather than the allowed 128-bit memory bandwidth when
each CPU has an even number of DIMMs?

>
>
>  > It looks like I may want to upgrade my motherboard before exploring
>  > NUMA / affinity then.
>
>  If you're getting into "upgrading" (ie: trowing money at) anything, then
>  you're getting into the slippery slope of the hardware selection debate ;)

Slippery indeed.  At this point, I think I may just install the RAM to
bring my current calculation out of swap and be done with the cluster
for now.  Given that I think one of my nodes uses hypertransport for
all of cpu 1 memory access, would it hurt anything to use affinity
when only 2 out of 3 nodes can benefit from affinity?
>
>
>  > This discussion as well as reading about NUMA and affinity elsewhere
>  > leads to another question - what is the difference between using
>  > numactl or using the affinity options of my parallelization software
>  > (in my case openmpi)?
>
>  numactl is an application to help nudge processes in the correct
>  direction. Implementing cpuaffinity within your code makes your code
>  explicitally aware that it will run on an SMP machine (ie: it's hardcoded
>  and you don't need to call a script to change your processe's affinity).
>
>  In that regards Chris Samuel replied with the mention of Torque and PBS
>  which would support affinity assignment. IMHO, that would be the most
>  logical place to control affinity (as long as one can provide some memory
>  access hints, ie: same options as seen in numactl's manpage)
>
>  > Thanks,
>  >
>  > Mark (Kosmowski)
>  >
>
>  Eric Thibodeau
>
>
Again, thank you for this discussion - I'm learning quite a bit!