[Beowulf] big read triggers migration and slow memory IO?
James Cuff
james_cuff at harvard.edu
Thu Jul 9 11:54:26 PDT 2015
Wow - yeah David this sure is a doozie!
Super long shot...
http://blog.jcuff.net/2015/04/of-huge-pages-and-huge-performance-hits.html
Best,
j.
--
dr. james cuff, assistant dean for research computing, harvard university |
division of science | thirty eight oxford street, cambridge. ma. 02138 | +1
617 384 7647 | http://rc.fas.harvard.edu
On Thu, Jul 9, 2015 at 2:44 PM, mathog <mathog at caltech.edu> wrote:
> Reran the generators and that did make the system slow again, so at least
> this problem can be reproduced.
>
> After those ran memory is definitely in short supply, pretty much
> everything is in file cache. For whatever reason, the system seems to be
> loathe to release memory from file cache for other uses. I think that is
> the problem.
>
> Here is some data, this is a bit long...
>
> numactl --hardware ho
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42
> 44 46
> node 0 size: 262098 MB
> node 0 free: 18372 MB
> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
> 45 47
> node 1 size: 262144 MB
> node 1 free: 2829 MB
> node distances:
> node 0 1
> 0: 10 20
> 1: 20 10
>
> CPU specific tests were done on 20, so NUMA node 0. None of the tests
> come close to using up all the physical memory in a "node", which is 262GB.
>
> When cache has been cleared, and the test programs run fast:
> cat /proc/meminfo | head -11
> MemTotal: 529231456 kB
> MemFree: 525988868 kB
> Buffers: 5428 kB
> Cached: 46544 kB
> SwapCached: 556 kB
> Active: 62220 kB
> Inactive: 121316 kB
> Active(anon): 26596 kB
> Inactive(anon): 109456 kB
> Active(file): 35624 kB
> Inactive(file): 11860 kB
>
> run one test and it jumps up to
>
> MemTotal: 529231456 kB
> MemFree: 491812500 kB
> Buffers: 10644 kB
> Cached: 34139976 kB
> SwapCached: 556 kB
> Active: 34152592 kB
> Inactive: 130400 kB
> Active(anon): 27560 kB
> Inactive(anon): 109316 kB
> Active(file): 34125032 kB
> Inactive(file): 21084 kB
>
> and the next test is still quick. After running the generators, but when
> nothing much is running, it starts like this:
>
> cat /proc/meminfo | head -11
> MemTotal: 529231456 kB
> MemFree: 19606616 kB
> Buffers: 46704 kB
> Cached: 493107268 kB
> SwapCached: 556 kB
> Active: 34229020 kB
> Inactive: 459056372 kB
> Active(anon): 712 kB
> Inactive(anon): 135508 kB
> Active(file): 34228308 kB
> Inactive(file): 458920864 kB
>
> Then when a test job is run it drops quickly to this and sticks. Note the
> MemFree value. I think this is where the "Events/20" process kicks in:
>
> cat /proc/meminfo | head -11
> MemTotal: 529231456 kB
> MemFree: 691740 kB
> Buffers: 46768 kB
> Cached: 493056968 kB
> SwapCached: 556 kB
> Active: 53164328 kB
> Inactive: 459006232 kB
> Active(anon): 18936048 kB
> Inactive(anon): 135608 kB
> Active(file): 34228280 kB
> Inactive(file): 458870624 kB
>
> Kill the process and the system "recovers" to the preceding memory
> configuration in a few seconds. Similarly /proc/zoneinfo values from
> before the generators were run, when the system was fast:
>
> extract -in state_zoneinfo_fast3.txt -if '^Node' -ifn 10 -ifonly
> Node 0, zone DMA
> pages free 3931
> min 0
> low 0
> high 0
> scanned 0
> spanned 4095
> present 3832
> nr_free_pages 3931
> nr_inactive_anon 0
> nr_active_anon 0
> Node 0, zone DMA32
> pages free 105973
> min 139
> low 173
> high 208
> scanned 0
> spanned 1044480
> present 822056
> nr_free_pages 105973
> nr_inactive_anon 0
> nr_active_anon 0
> Node 0, zone Normal
> pages free 50199731
> min 11122
> low 13902
> high 16683
> scanned 0
> spanned 66256896
> present 65351040
> nr_free_pages 50199731
> nr_inactive_anon 16490
> nr_active_anon 7191
> Node 1, zone Normal
> pages free 57596396
> min 11265
> low 14081
> high 16897
> scanned 0
> spanned 67108864
> present 66191360
> nr_free_pages 57596396
> nr_inactive_anon 10839
> nr_active_anon 1772
>
> and after the generators were run (slow):
>
> Node 0, zone DMA
> pages free 3931
> min 0
> low 0
> high 0
> scanned 0
> spanned 4095
> present 3832
> nr_free_pages 3931
> nr_inactive_anon 0
> nr_active_anon 0
> Node 0, zone DMA32
> pages free 105973
> min 139
> low 173
> high 208
> scanned 0
> spanned 1044480
> present 822056
> nr_free_pages 105973
> nr_inactive_anon 0
> nr_active_anon 0
> Node 0, zone Normal
> pages free 23045
> min 11122
> low 13902
> high 16683
> scanned 0
> spanned 66256896
> present 65351040
> nr_free_pages 23045
> nr_inactive_anon 16486
> nr_active_anon 5839
> Node 1, zone Normal
> pages free 33726
> min 11265
> low 14081
> high 16897
> scanned 0
> spanned 67108864
> present 66191360
> nr_free_pages 33726
> nr_inactive_anon 10836
> nr_active_anon 1065
>
> Looking the same way at /proc/zoninfo while a test is running showed
> the "pages free" and "nr_free_pages" values oscillating downward to
> a low of about 28000 for Node 0, zone Normal. The rest of the values were
> essentially stable.
>
> Looking the same way at /proc/meminfo while a test is running gave values
> that differed in only minor ways from the "after" table shown above.
> MemFree varied in a range from abut 680000 to 720000.
> Cached dropped to ~482407184 kB and then budged barely at all.
>
> Finally the last few lines from "sar -B" (sorry about the wrap)
>
> 10:30:03 AM 5810.55 301475.26 95.99 0.05 51710.29 48086.79
> 0.00 48084.94 100.00
> 10:40:01 AM 3404.90 185502.87 96.67 0.01 47267.84 44816.30
> 0.00 44816.30 100.00
> 10:50:02 AM 9.13 13.32 192.24 0.11 4592.56 48.54
> 3149.01 3197.55 100.00
> 11:00:01 AM 191.78 9.97 347.56 0.13 16760.51 0.00
> 3683.21 3683.21 100.00
>
> 11:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s
> pgscand/s pgsteal/s %vmeff
> 11:10:01 AM 11.64 7.75 342.59 0.09 18528.24 0.00
> 1699.66 1699.66 100.00
> 11:20:01 AM 0.00 6.75 96.87 0.00 43.97 0.00
> 0.00 0.00 0.00
>
> The generators finished at 10:35. The data point at 10:30 (while they
> were running) pgscank/s and pgsteal/s jumped up from 0 to high values.
> When later tests were run the former fell down to not much but the latter
> stayed high. Additionally when the test runs were made following the
> generator it pushed pgscand/s from 0 to several thousand per second. The
> last row consists of a 10 minute span where no tests were run, and these
> values all dropped back to zero.
>
> Since excessive file cache seems to implicated did this:
> echo 3 > /proc/sys/vm/drop_caches
>
> and reran the test on node 20. It was fast.
>
> I guess the question now is what parameter(s) control(s) the conversion
> from memory in file cache to memory needed for other purposes when free
> memory is in short supply and there is substantial demand. It seems the OS
> isn't releasing cache. Or maybe it isn't flushing it to disk. I don't
> think it's the latter because iotop and iostat don't show any activity
> during a "slow" read.
>
> Thank,
>
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150709/9e098a7e/attachment-0001.html>
More information about the Beowulf
mailing list