[Beowulf] big read triggers migration and slow memory IO?
Steffen Persvold
sp at numascale.com
Thu Jul 9 12:28:44 PDT 2015
Hi David,
Have you tried setting /proc/sys/vm/zone_reclaim_mode to 3 or 7 ?
Cheers,
--
Steffen Persvold
Chief Architect NumaChip, Numascale AS
Tel: +47 23 16 71 88 Fax: +47 23 16 71 80 Skype: spersvold
> On 09 Jul 2015, at 20:44, mathog <mathog at caltech.edu> wrote:
>
> Reran the generators and that did make the system slow again, so at least this problem can be reproduced.
>
> After those ran memory is definitely in short supply, pretty much everything is in file cache. For whatever reason, the system seems to be loathe to release memory from file cache for other uses. I think that is the problem.
>
> Here is some data, this is a bit long...
>
> numactl --hardware ho
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46
> node 0 size: 262098 MB
> node 0 free: 18372 MB
> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
> node 1 size: 262144 MB
> node 1 free: 2829 MB
> node distances:
> node 0 1
> 0: 10 20
> 1: 20 10
>
> CPU specific tests were done on 20, so NUMA node 0. None of the tests come close to using up all the physical memory in a "node", which is 262GB.
>
> When cache has been cleared, and the test programs run fast:
> cat /proc/meminfo | head -11
> MemTotal: 529231456 kB
> MemFree: 525988868 kB
> Buffers: 5428 kB
> Cached: 46544 kB
> SwapCached: 556 kB
> Active: 62220 kB
> Inactive: 121316 kB
> Active(anon): 26596 kB
> Inactive(anon): 109456 kB
> Active(file): 35624 kB
> Inactive(file): 11860 kB
>
> run one test and it jumps up to
>
> MemTotal: 529231456 kB
> MemFree: 491812500 kB
> Buffers: 10644 kB
> Cached: 34139976 kB
> SwapCached: 556 kB
> Active: 34152592 kB
> Inactive: 130400 kB
> Active(anon): 27560 kB
> Inactive(anon): 109316 kB
> Active(file): 34125032 kB
> Inactive(file): 21084 kB
>
> and the next test is still quick. After running the generators, but when nothing much is running, it starts like this:
>
> cat /proc/meminfo | head -11
> MemTotal: 529231456 kB
> MemFree: 19606616 kB
> Buffers: 46704 kB
> Cached: 493107268 kB
> SwapCached: 556 kB
> Active: 34229020 kB
> Inactive: 459056372 kB
> Active(anon): 712 kB
> Inactive(anon): 135508 kB
> Active(file): 34228308 kB
> Inactive(file): 458920864 kB
>
> Then when a test job is run it drops quickly to this and sticks. Note the MemFree value. I think this is where the "Events/20" process kicks in:
>
> cat /proc/meminfo | head -11
> MemTotal: 529231456 kB
> MemFree: 691740 kB
> Buffers: 46768 kB
> Cached: 493056968 kB
> SwapCached: 556 kB
> Active: 53164328 kB
> Inactive: 459006232 kB
> Active(anon): 18936048 kB
> Inactive(anon): 135608 kB
> Active(file): 34228280 kB
> Inactive(file): 458870624 kB
>
> Kill the process and the system "recovers" to the preceding memory configuration in a few seconds. Similarly /proc/zoneinfo values from before the generators were run, when the system was fast:
>
> extract -in state_zoneinfo_fast3.txt -if '^Node' -ifn 10 -ifonly
> Node 0, zone DMA
> pages free 3931
> min 0
> low 0
> high 0
> scanned 0
> spanned 4095
> present 3832
> nr_free_pages 3931
> nr_inactive_anon 0
> nr_active_anon 0
> Node 0, zone DMA32
> pages free 105973
> min 139
> low 173
> high 208
> scanned 0
> spanned 1044480
> present 822056
> nr_free_pages 105973
> nr_inactive_anon 0
> nr_active_anon 0
> Node 0, zone Normal
> pages free 50199731
> min 11122
> low 13902
> high 16683
> scanned 0
> spanned 66256896
> present 65351040
> nr_free_pages 50199731
> nr_inactive_anon 16490
> nr_active_anon 7191
> Node 1, zone Normal
> pages free 57596396
> min 11265
> low 14081
> high 16897
> scanned 0
> spanned 67108864
> present 66191360
> nr_free_pages 57596396
> nr_inactive_anon 10839
> nr_active_anon 1772
>
> and after the generators were run (slow):
>
> Node 0, zone DMA
> pages free 3931
> min 0
> low 0
> high 0
> scanned 0
> spanned 4095
> present 3832
> nr_free_pages 3931
> nr_inactive_anon 0
> nr_active_anon 0
> Node 0, zone DMA32
> pages free 105973
> min 139
> low 173
> high 208
> scanned 0
> spanned 1044480
> present 822056
> nr_free_pages 105973
> nr_inactive_anon 0
> nr_active_anon 0
> Node 0, zone Normal
> pages free 23045
> min 11122
> low 13902
> high 16683
> scanned 0
> spanned 66256896
> present 65351040
> nr_free_pages 23045
> nr_inactive_anon 16486
> nr_active_anon 5839
> Node 1, zone Normal
> pages free 33726
> min 11265
> low 14081
> high 16897
> scanned 0
> spanned 67108864
> present 66191360
> nr_free_pages 33726
> nr_inactive_anon 10836
> nr_active_anon 1065
>
> Looking the same way at /proc/zoninfo while a test is running showed
> the "pages free" and "nr_free_pages" values oscillating downward to
> a low of about 28000 for Node 0, zone Normal. The rest of the values were essentially stable.
>
> Looking the same way at /proc/meminfo while a test is running gave values that differed in only minor ways from the "after" table shown above. MemFree varied in a range from abut 680000 to 720000.
> Cached dropped to ~482407184 kB and then budged barely at all.
>
> Finally the last few lines from "sar -B" (sorry about the wrap)
>
> 10:30:03 AM 5810.55 301475.26 95.99 0.05 51710.29 48086.79 0.00 48084.94 100.00
> 10:40:01 AM 3404.90 185502.87 96.67 0.01 47267.84 44816.30 0.00 44816.30 100.00
> 10:50:02 AM 9.13 13.32 192.24 0.11 4592.56 48.54 3149.01 3197.55 100.00
> 11:00:01 AM 191.78 9.97 347.56 0.13 16760.51 0.00 3683.21 3683.21 100.00
>
> 11:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff
> 11:10:01 AM 11.64 7.75 342.59 0.09 18528.24 0.00 1699.66 1699.66 100.00
> 11:20:01 AM 0.00 6.75 96.87 0.00 43.97 0.00 0.00 0.00 0.00
>
> The generators finished at 10:35. The data point at 10:30 (while they were running) pgscank/s and pgsteal/s jumped up from 0 to high values. When later tests were run the former fell down to not much but the latter stayed high. Additionally when the test runs were made following the generator it pushed pgscand/s from 0 to several thousand per second. The last row consists of a 10 minute span where no tests were run, and these values all dropped back to zero.
>
> Since excessive file cache seems to implicated did this:
> echo 3 > /proc/sys/vm/drop_caches
>
> and reran the test on node 20. It was fast.
>
> I guess the question now is what parameter(s) control(s) the conversion from memory in file cache to memory needed for other purposes when free memory is in short supply and there is substantial demand. It seems the OS isn't releasing cache. Or maybe it isn't flushing it to disk. I don't think it's the latter because iotop and iostat don't show any activity during a "slow" read.
>
> Thank,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list