[Beowulf] big read triggers migration and slow memory IO?
mathog
mathog at caltech.edu
Wed Jul 8 16:45:44 PDT 2015
On 08-Jul-2015 15:43, Jonathan Barber wrote:
> I think you're process is being moved between NUMA nodes and you're
> losing
> locality to the data. Try confining the process and data to the same
> NUMA
> node with the numactl command.
That's part of it. I ran a bunch of commands like this:
taskset -c 20 dd if=KTEMP1 of=KTEMP0 bs=120000 count=34000
taskset -c 20 testprogram -in KTEMP0
with these results:
count size Gb time(s) size bytes
34000 ~4 ~3 4080000000
68000 ~8 ~7 8160000000
70000 ~8 ~3 8400000000
100000 ~12 ~3 12000000000
120000 ~14 ~7 14400000000
130000 ~16 ~9 15600000000
140000 ~17 >120 16800000000 (2^34 is 17179869184)
(I didn't wait for the 140000 to complete, it could have gone on for
another 5 minutes.) The variation between ~3s and ~9s isn't significant
or repeatable, I think it represents the flush process getting in the
way of the second command.
If the test was changed so that "-c 1" was used for the first command
and "-c 20" for the second, then the 130000 record case took
23s. So there is definitely an advantage in having the file cache pages
somehow associated with the CPU where they will be needed next.
Now the mystery is what the problem is for an fread() into a buffer of
close to, but just below, 2^34.
Here is ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 4134441
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Nothing there that screams 2^34 to me. Perhaps something crucial for
performance needs to be locked into memory and grows beyond 64kb at that
buffer size, and that indirectly leads to the performance problem.
As an aside, when the test program is locked to a CPU and a file which
is "too big" is read there is no migration/20 process using CPU time.
Instead, there is an events/20 that starts using up a significant amount
of CPU time (varying wildly around 30%). ksoftirqd/20 also comes and
goes, so that could also be a factor.
>
> Assuming your machine is NUMA (hwloc-ls will show you this) in my
> experience some of the E5's have really poor performance for inter-NUMA
> communication.
I don't have anything called hwloc-ls on this system. What package
provides it? This is a Centos system.
Thanks,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list