[Beowulf] Again about NUMA (numactl and taskset)
Håkon Bugge
Hakon.Bugge at scali.com
Thu Jun 26 02:05:50 PDT 2008
At 01:23 25.06.2008, Chris Samuel wrote:
> > IMHO, the MPI should virtualize these resources
> > and relieve the end-user/application programmer
> > from the burden.
>
>IMHO the resource manager (Torque, SGE, LSF, etc) should
>be setting up cpusets for the jobs based on what the
>scheduler has told it to use and the MPI shouldn't
>get a choice in the matter. :-)
I am inclined to agree with you in a perfect
world. But, from my understanding the resource
managers does not know the relationship between
the cores. E.g., does core 3 and core 5 share a
cache? Do they share a north-bridge bus, or are
they located on different sockets?
This is information we're using to optimize how
pnt-to-pnt communication is implemented. The
code-base involved is fairly complicated and I do
not expect resource management systems to cope with it.
I posted some measurement of the benefit of this
methods some time ago and I include it here as a
reference:
http://www.scali.com/info/SHM-perf-8bytes-2007-12-20.htm
If you look at the ping-ping numbers, you will se
a nearly constant message rate, independent of
placement of the processes. This contrary to
other MPIs which (apparently) does not use this technique.
So, in a practical world I go for performance, not perfect layering ;-)
>Also helps when newbies run OpenMP codes thinking they're
>single CPU codes and get 3 or 4 on the same 8 CPU node.
Not sure I read you here. Do you mean pure OMP or hybrid models?
Thanks, Håkon
More information about the Beowulf
mailing list