[Beowulf] numactl and load balancing
Lawrence Stewart
stewart at serissa.com
Thu Jul 23 13:37:38 PDT 2015
I wouldn’t expect this technique to work the way you want, even if it
did start the jobs in the right places.
The L3 cache will have <some> associativity, but I doubt it is 15-way, so there
may be random collisions in the cache between the datasets of the different
copies of your program. There is enough aggregage space, but no assurance the
system will figure out how to tile it correctly.
It would probably work more reliably to write a single program with a 30MB array,
then start 15 threads, giving each thread a contiguous 2MB chunk of that array.
That way, you are assured of virtually distinct addresses, and if you can, for
example, mmap a 30 MB file using hugetlbfs, you may be able to get physically
contiguous as well. Any reasonable cache design will run that with no collisions.
You can use the library functions to set cpuaffinity for each thread.
You can run two copies, of this thing, one per socket, using cpuset or numactl
> On 2015, Jul 23, at 3:03 PM, mathog <mathog at caltech.edu> wrote:
>
> Dell with 2CPU x 12core x 2 threads, shows up in procinfo as 48 cpus.
>
> Trying to run 30 processes 1 each on different "CPU"s by starting them one at a time with
>
> numactl -C 1-30 /$PATH/program #args...
>
> when 30 have started the script spins waiting for one to exit then another is started. "top" is showing some of these are running at 50% CPU, so they are being started on a CPU which already has a job going. I can see where that would happen, since there doesn't seem to be anything in numactl about load balancing. The thing is, these processes are _staying_ on the same CPU, never migrating to another. That I don't understand. I would have thought numactl sets some mask on the process restricting the CPUs it can move to, but would not otherwise affect it, so the OS should migrate it when it sees this situation. In practice it seems to leave it running on whichever CPU it starts on. Or does linux not migrate processes when they are heavily loading a single CPU, only when they run out of memory???
>
> Also "perf top" shows 81% for the program and 13% for numactl.
>
> The goal here is to carefully divvy up the load so that exactly 15 jobs run on each Numa zone, since then the data in all the inner loops will fit within the 30M of L3 cache on each CPU. If it puts 17 on one and 13 the inner loop data won't fit and performance slows down dramatically. Looks like I need to keep track of which job is running where and numactl lock it to that node. (I don't think there is a queue system on this machine at present.)
>
> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list