[Beowulf] Again about NUMA (numactl and taskset)

Thu Jun 26 02:05:50 PDT 2008

At 01:23 25.06.2008, Chris Samuel wrote:
> > IMHO, the MPI should virtualize these resources
> > and relieve the end-user/application programmer
> > from the burden.
>
>IMHO the resource manager (Torque, SGE, LSF, etc) should
>be setting up cpusets for the jobs based on what the
>scheduler has told it to use and the MPI shouldn't
>get a choice in the matter. :-)

I am inclined to agree with you in a perfect 
world. But, from my understanding the resource 
managers does not know the relationship between 
the cores. E.g., does core 3 and core 5 share a 
cache? Do they share a north-bridge bus, or are 
they located on different sockets?

This is information we're using to optimize how 
pnt-to-pnt communication is implemented. The 
code-base involved is fairly complicated and I do 
not expect resource management systems to cope with it.

I posted some measurement of the benefit of this 
methods some time ago and I include it here as a 
reference: 
http://www.scali.com/info/SHM-perf-8bytes-2007-12-20.htm 
If you look at the ping-ping numbers, you will se 
a nearly constant message rate, independent of 
placement of the processes. This contrary to 
other MPIs which (apparently) does not use this technique.

So, in a practical world I go for performance, not perfect layering ;-)

>Also helps when newbies run OpenMP codes thinking they're
>single CPU codes and get 3 or 4 on the same 8 CPU node.

Not sure I read you here. Do you mean pure OMP or hybrid models?

Thanks, Håkon