[Beowulf] big read triggers migration and slow memory IO?

Fri Jul 10 12:49:27 PDT 2015

FYI

I was curious how well the NUMA memory system performs when reading data 
from file cache under various loads, so ran a script which used dd to 
test that.  dd itself uses up very little memory, most likely it fits in 
the CPU caches, so the tests are just of file cache reads.

First the script made a 2^26 byte (68.7 GB) test file on a specific CPU. 
  Then it timed 3 cycles of tests like:

taskset -c 10 dd if=BIGTEST of=/dev/null bs=8192 </dev/null 2>/dev/null 
&
# one or more instances similar to the preceding line,
# different CPU numbers on each
wait

"same" refers to the CPU/NUMA node where the file was created by dd.
It was created on cpu 9, and tests were run on CPUs 10 and up, odd ones 
were all on one NUMA node, even on the other.

The results in seconds were:

14.002 11.754 11.446,  1  same CPU,   same NUMA node
12.061 11.359 11.414,  1 !same CPU,   same NUMA node
17.784 17.221 17.330,  1 !same CPU,  !same NUMA node
16.900 16.717 16.586,  5 !same CPUs,  same NUMA node
25.455 23.467 24.387, 15 !same CPUs,  same NUMA node
19.810 19.640 19.453,  5 !same CPUs, !same NUMA node
29.388 28.573 29.295, 15 !same CPUs, !same NUMA node
18.182 18.525 18.252, 10 !same CPUs, 5:5   same:!same NUMA node
28.357 29.328 28.850, 30 !same CPUs, 15:15 same:!same NUMA node

numactl -hardware | tail -4
node distances:
node   0   1
   0:  10  20
   1:  20  10

 From the nodes distances shown I was expecting !same NUMA node reads to 
take twice as long as same NUMA node reads, but the penalty is actually 
only around 42%.  Unclear where "10" and "20" come from for
the numactl node distances. It doesn't look like the file cache pages
migrated from one node to the other during these tests, as the !same
node reads stayed consistently slower.  Reading with more and more cpus 
on a node did slow down the read rate for each CPU, but not as badly as 
I had feared it might, not nearly as bad as 1/N, with the read rate only 
falling from a fastest of 6GB/s to a slowest of 2.3GB/s (much better 
than the potentially sluggish 6/15 = 0.4GB/s). It may be that adding 
reads on !same NUMA node slows down the same NUMA node CPUs (5th and 9th 
rows), one cannot tell here because the time is just until the last 
process exits.  In any case, the converse wasn't true (7th and 9th 
rows).

Finally, most of these tests were very nearly synchronous for all of the 
dd operations. So a few were run with offsets of 1 second between  
blocks of dd commands.  If breaking synchrony made no difference than 
these would run only 2 seconds slower.  If breaking synchrony mattered 
they would run much slower.  (In the table below "between N's" means 
that the dd commands were split into groups of N and a "sleep 1" placed 
between the groups.)

26.019 26.996 26.932, 15 !same CPUs,  same NUMA node, 2s total delay, 1s 
between 5's
28.048 29.692 27.233, 15 !same CPUs,  same NUMA node, 4s total delay, 1s 
between 3's
29.278 27.971 30.183, 15 !same CPUs, !same NUMA node, 2s total delay, 1s 
between 5's
29.066 30.683 29.958, 15 !same CPUs, !same NUMA node, 4s total delay, 1s 
between 3's
29.172 28.374 29.067, 30 !same CPUs, 15:15 same:!same NUMA node 2s total 
delay, 1s between 10's
30.167 30.723 29.809, 30 !same CPUs, 15:15 same:!same NUMA node 4s total 
delay, 1s between 6's

The results surprised me.  The expected delays showed up when they were 
introduced on the same NUMA node (first two tests), but were invisible 
when !same NUMA node CPUs were used only or in combination (the 
remainder of the tests).  It seems that the introduced delays somehow 
are absorbed into existing delays in moving data from one NUMA node to 
the other, and one ends up with the odd situation of 30+4=30!  (Yes, I 
checked the script thoroughly, the "sleep 1" commands really did execute 
on those tests.)

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech