[Beowulf] big read triggers migration and slow memory IO?
mathog
mathog at caltech.edu
Thu Jul 9 11:44:09 PDT 2015
Reran the generators and that did make the system slow again, so at
least this problem can be reproduced.
After those ran memory is definitely in short supply, pretty much
everything is in file cache. For whatever reason, the system seems to
be loathe to release memory from file cache for other uses. I think
that is the problem.
Here is some data, this is a bit long...
numactl --hardware ho
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
42 44 46
node 0 size: 262098 MB
node 0 free: 18372 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
43 45 47
node 1 size: 262144 MB
node 1 free: 2829 MB
node distances:
node 0 1
0: 10 20
1: 20 10
CPU specific tests were done on 20, so NUMA node 0. None of the tests
come close to using up all the physical memory in a "node", which is
262GB.
When cache has been cleared, and the test programs run fast:
cat /proc/meminfo | head -11
MemTotal: 529231456 kB
MemFree: 525988868 kB
Buffers: 5428 kB
Cached: 46544 kB
SwapCached: 556 kB
Active: 62220 kB
Inactive: 121316 kB
Active(anon): 26596 kB
Inactive(anon): 109456 kB
Active(file): 35624 kB
Inactive(file): 11860 kB
run one test and it jumps up to
MemTotal: 529231456 kB
MemFree: 491812500 kB
Buffers: 10644 kB
Cached: 34139976 kB
SwapCached: 556 kB
Active: 34152592 kB
Inactive: 130400 kB
Active(anon): 27560 kB
Inactive(anon): 109316 kB
Active(file): 34125032 kB
Inactive(file): 21084 kB
and the next test is still quick. After running the generators, but
when nothing much is running, it starts like this:
cat /proc/meminfo | head -11
MemTotal: 529231456 kB
MemFree: 19606616 kB
Buffers: 46704 kB
Cached: 493107268 kB
SwapCached: 556 kB
Active: 34229020 kB
Inactive: 459056372 kB
Active(anon): 712 kB
Inactive(anon): 135508 kB
Active(file): 34228308 kB
Inactive(file): 458920864 kB
Then when a test job is run it drops quickly to this and sticks. Note
the MemFree value. I think this is where the "Events/20" process kicks
in:
cat /proc/meminfo | head -11
MemTotal: 529231456 kB
MemFree: 691740 kB
Buffers: 46768 kB
Cached: 493056968 kB
SwapCached: 556 kB
Active: 53164328 kB
Inactive: 459006232 kB
Active(anon): 18936048 kB
Inactive(anon): 135608 kB
Active(file): 34228280 kB
Inactive(file): 458870624 kB
Kill the process and the system "recovers" to the preceding memory
configuration in a few seconds. Similarly /proc/zoneinfo values from
before the generators were run, when the system was fast:
extract -in state_zoneinfo_fast3.txt -if '^Node' -ifn 10 -ifonly
Node 0, zone DMA
pages free 3931
min 0
low 0
high 0
scanned 0
spanned 4095
present 3832
nr_free_pages 3931
nr_inactive_anon 0
nr_active_anon 0
Node 0, zone DMA32
pages free 105973
min 139
low 173
high 208
scanned 0
spanned 1044480
present 822056
nr_free_pages 105973
nr_inactive_anon 0
nr_active_anon 0
Node 0, zone Normal
pages free 50199731
min 11122
low 13902
high 16683
scanned 0
spanned 66256896
present 65351040
nr_free_pages 50199731
nr_inactive_anon 16490
nr_active_anon 7191
Node 1, zone Normal
pages free 57596396
min 11265
low 14081
high 16897
scanned 0
spanned 67108864
present 66191360
nr_free_pages 57596396
nr_inactive_anon 10839
nr_active_anon 1772
and after the generators were run (slow):
Node 0, zone DMA
pages free 3931
min 0
low 0
high 0
scanned 0
spanned 4095
present 3832
nr_free_pages 3931
nr_inactive_anon 0
nr_active_anon 0
Node 0, zone DMA32
pages free 105973
min 139
low 173
high 208
scanned 0
spanned 1044480
present 822056
nr_free_pages 105973
nr_inactive_anon 0
nr_active_anon 0
Node 0, zone Normal
pages free 23045
min 11122
low 13902
high 16683
scanned 0
spanned 66256896
present 65351040
nr_free_pages 23045
nr_inactive_anon 16486
nr_active_anon 5839
Node 1, zone Normal
pages free 33726
min 11265
low 14081
high 16897
scanned 0
spanned 67108864
present 66191360
nr_free_pages 33726
nr_inactive_anon 10836
nr_active_anon 1065
Looking the same way at /proc/zoninfo while a test is running showed
the "pages free" and "nr_free_pages" values oscillating downward to
a low of about 28000 for Node 0, zone Normal. The rest of the values
were essentially stable.
Looking the same way at /proc/meminfo while a test is running gave
values that differed in only minor ways from the "after" table shown
above. MemFree varied in a range from abut 680000 to 720000.
Cached dropped to ~482407184 kB and then budged barely at all.
Finally the last few lines from "sar -B" (sorry about the wrap)
10:30:03 AM 5810.55 301475.26 95.99 0.05 51710.29 48086.79
0.00 48084.94 100.00
10:40:01 AM 3404.90 185502.87 96.67 0.01 47267.84 44816.30
0.00 44816.30 100.00
10:50:02 AM 9.13 13.32 192.24 0.11 4592.56 48.54
3149.01 3197.55 100.00
11:00:01 AM 191.78 9.97 347.56 0.13 16760.51 0.00
3683.21 3683.21 100.00
11:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s
pgscand/s pgsteal/s %vmeff
11:10:01 AM 11.64 7.75 342.59 0.09 18528.24 0.00
1699.66 1699.66 100.00
11:20:01 AM 0.00 6.75 96.87 0.00 43.97 0.00
0.00 0.00 0.00
The generators finished at 10:35. The data point at 10:30 (while they
were running) pgscank/s and pgsteal/s jumped up from 0 to high values.
When later tests were run the former fell down to not much but the
latter stayed high. Additionally when the test runs were made following
the generator it pushed pgscand/s from 0 to several thousand per second.
The last row consists of a 10 minute span where no tests were run, and
these values all dropped back to zero.
Since excessive file cache seems to implicated did this:
echo 3 > /proc/sys/vm/drop_caches
and reran the test on node 20. It was fast.
I guess the question now is what parameter(s) control(s) the conversion
from memory in file cache to memory needed for other purposes when free
memory is in short supply and there is substantial demand. It seems the
OS isn't releasing cache. Or maybe it isn't flushing it to disk. I
don't think it's the latter because iotop and iostat don't show any
activity during a "slow" read.
Thank,
David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf
mailing list