[Beowulf] big read triggers migration and slow memory IO?
mathog
mathog at caltech.edu
Fri Jul 10 12:49:27 PDT 2015
FYI
I was curious how well the NUMA memory system performs when reading data
from file cache under various loads, so ran a script which used dd to
test that. dd itself uses up very little memory, most likely it fits in
the CPU caches, so the tests are just of file cache reads.
First the script made a 2^26 byte (68.7 GB) test file on a specific CPU.
Then it timed 3 cycles of tests like:
taskset -c 10 dd if=BIGTEST of=/dev/null bs=8192 </dev/null 2>/dev/null
&
# one or more instances similar to the preceding line,
# different CPU numbers on each
wait
"same" refers to the CPU/NUMA node where the file was created by dd.
It was created on cpu 9, and tests were run on CPUs 10 and up, odd ones
were all on one NUMA node, even on the other.
The results in seconds were:
14.002 11.754 11.446, 1 same CPU, same NUMA node
12.061 11.359 11.414, 1 !same CPU, same NUMA node
17.784 17.221 17.330, 1 !same CPU, !same NUMA node
16.900 16.717 16.586, 5 !same CPUs, same NUMA node
25.455 23.467 24.387, 15 !same CPUs, same NUMA node
19.810 19.640 19.453, 5 !same CPUs, !same NUMA node
29.388 28.573 29.295, 15 !same CPUs, !same NUMA node
18.182 18.525 18.252, 10 !same CPUs, 5:5 same:!same NUMA node
28.357 29.328 28.850, 30 !same CPUs, 15:15 same:!same NUMA node
numactl -hardware | tail -4
node distances:
node 0 1
0: 10 20
1: 20 10
From the nodes distances shown I was expecting !same NUMA node reads to
take twice as long as same NUMA node reads, but the penalty is actually
only around 42%. Unclear where "10" and "20" come from for
the numactl node distances. It doesn't look like the file cache pages
migrated from one node to the other during these tests, as the !same
node reads stayed consistently slower. Reading with more and more cpus
on a node did slow down the read rate for each CPU, but not as badly as
I had feared it might, not nearly as bad as 1/N, with the read rate only
falling from a fastest of 6GB/s to a slowest of 2.3GB/s (much better
than the potentially sluggish 6/15 = 0.4GB/s). It may be that adding
reads on !same NUMA node slows down the same NUMA node CPUs (5th and 9th
rows), one cannot tell here because the time is just until the last
process exits. In any case, the converse wasn't true (7th and 9th
rows).
Finally, most of these tests were very nearly synchronous for all of the
dd operations. So a few were run with offsets of 1 second between
blocks of dd commands. If breaking synchrony made no difference than
these would run only 2 seconds slower. If breaking synchrony mattered
they would run much slower. (In the table below "between N's" means
that the dd commands were split into groups of N and a "sleep 1" placed
between the groups.)
26.019 26.996 26.932, 15 !same CPUs, same NUMA node, 2s total delay, 1s
between 5's
28.048 29.692 27.233, 15 !same CPUs, same NUMA node, 4s total delay, 1s
between 3's
29.278 27.971 30.183, 15 !same CPUs, !same NUMA node, 2s total delay, 1s
between 5's
29.066 30.683 29.958, 15 !same CPUs, !same NUMA node, 4s total delay, 1s
between 3's
29.172 28.374 29.067, 30 !same CPUs, 15:15 same:!same NUMA node 2s total
delay, 1s between 10's
30.167 30.723 29.809, 30 !same CPUs, 15:15 same:!same NUMA node 4s total
delay, 1s between 6's
The results surprised me. The expected delays showed up when they were
introduced on the same NUMA node (first two tests), but were invisible
when !same NUMA node CPUs were used only or in combination (the
remainder of the tests). It seems that the introduced delays somehow
are absorbed into existing delays in moving data from one NUMA node to
the other, and one ends up with the odd situation of 30+4=30! (Yes, I
checked the script thoroughly, the "sleep 1" commands really did execute
on those tests.)
Regards,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list