[Beowulf] Again about NUMA (numactl and taskset)

Fri Jul 18 07:46:59 PDT 2008

At 08:39 27.06.2008, Patrick Geoffray wrote:
>Hi Hakon,
>
>Håkon Bugge wrote:
>>This is information we're using to optimize how 
>>pnt-to-pnt communication is implemented. The 
>>code-base involved is fairly complicated and I 
>>do not expect resource management systems to cope with it.
>
>Why not ? It's its job to know the resources it 
>has to manage. The resource manager has more 
>information than you, it does not have to detect 
>at runtime for each job, and it can manage cores 
>allocation across jobs. You cannot expect the 
>granularity of the allocation to stay at the 
>node level with the core count increasing.

This raises two questions: a) Which job 
schedulers are able to optimize placement on 
cores thereby _improving_ application 
performance? b) which job schedulers are able to 
deduct which cores share a L3 cache and are situated on the same socket?

... and a clarification. Systems using Scali MPI 
Connect _can_ have finer granularity than the 
node level; the job scheduler must just not 
oversubscribe. Assignment of cores to processes 
is _dynamically_ done by Scali MPI Connect.

>If the MPI implementation does the spawning, it 
>should definitively have support to enforce core 
>affinity (most do AFAIK). However, core affinity 
>should be dictated by the scheduler. Heck, the 
>MPI implementation should not do the spawning in the first place.
>
>Historically, resource managers have been pretty 
>dumb. These days, there is enough competition in this domain to expect better.

I am fine with the schedulers dictating it, but not if the performance is hurt.

Håkon