[Beowulf] big read triggers migration and slow memory IO?

Thu Jul 9 11:54:26 PDT 2015

Wow - yeah David this sure is a doozie!

Super long shot...

http://blog.jcuff.net/2015/04/of-huge-pages-and-huge-performance-hits.html

Best,

j.

--
dr. james cuff, assistant dean for research computing, harvard university |
division of science | thirty eight oxford street, cambridge. ma. 02138 | +1
617 384 7647 | http://rc.fas.harvard.edu

On Thu, Jul 9, 2015 at 2:44 PM, mathog <mathog at caltech.edu> wrote:

> Reran the generators and that did make the system slow again, so at least
> this problem can be reproduced.
>
> After those ran memory is definitely in short supply, pretty much
> everything is in file cache.  For whatever reason, the system seems to be
> loathe to release memory from file cache for other uses.  I think that is
> the problem.
>
> Here is some data, this is a bit long...
>
> numactl --hardware ho
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42
> 44 46
> node 0 size: 262098 MB
> node 0 free: 18372 MB
> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
> 45 47
> node 1 size: 262144 MB
> node 1 free: 2829 MB
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
>
> CPU specific tests were done on 20, so NUMA node 0.  None of the tests
> come close to using up all the physical memory in a "node", which is 262GB.
>
> When cache has been cleared, and the test programs run fast:
> cat /proc/meminfo | head -11
> MemTotal:       529231456 kB
> MemFree:        525988868 kB
> Buffers:            5428 kB
> Cached:            46544 kB
> SwapCached:          556 kB
> Active:            62220 kB
> Inactive:         121316 kB
> Active(anon):      26596 kB
> Inactive(anon):   109456 kB
> Active(file):      35624 kB
> Inactive(file):    11860 kB
>
> run one test and it jumps up to
>
> MemTotal:       529231456 kB
> MemFree:        491812500 kB
> Buffers:           10644 kB
> Cached:         34139976 kB
> SwapCached:          556 kB
> Active:         34152592 kB
> Inactive:         130400 kB
> Active(anon):      27560 kB
> Inactive(anon):   109316 kB
> Active(file):   34125032 kB
> Inactive(file):    21084 kB
>
> and the next test is still quick.  After running the generators, but when
> nothing much is running, it starts like this:
>
> cat /proc/meminfo | head -11
> MemTotal:       529231456 kB
> MemFree:        19606616 kB
> Buffers:           46704 kB
> Cached:         493107268 kB
> SwapCached:          556 kB
> Active:         34229020 kB
> Inactive:       459056372 kB
> Active(anon):        712 kB
> Inactive(anon):   135508 kB
> Active(file):   34228308 kB
> Inactive(file): 458920864 kB
>
> Then when a test job is run it drops quickly to this and sticks. Note the
> MemFree value.  I think this is where the "Events/20" process kicks in:
>
> cat /proc/meminfo | head -11
> MemTotal:       529231456 kB
> MemFree:          691740 kB
> Buffers:           46768 kB
> Cached:         493056968 kB
> SwapCached:          556 kB
> Active:         53164328 kB
> Inactive:       459006232 kB
> Active(anon):   18936048 kB
> Inactive(anon):   135608 kB
> Active(file):   34228280 kB
> Inactive(file): 458870624 kB
>
> Kill the process and the system "recovers" to the preceding memory
> configuration in a few seconds.  Similarly /proc/zoneinfo values from
> before the generators were run, when the system was fast:
>
>  extract -in state_zoneinfo_fast3.txt -if '^Node' -ifn 10  -ifonly
> Node 0, zone      DMA
>   pages free     3931
>         min      0
>         low      0
>         high     0
>         scanned  0
>         spanned  4095
>         present  3832
>     nr_free_pages 3931
>     nr_inactive_anon 0
>     nr_active_anon 0
> Node 0, zone    DMA32
>   pages free     105973
>         min      139
>         low      173
>         high     208
>         scanned  0
>         spanned  1044480
>         present  822056
>     nr_free_pages 105973
>     nr_inactive_anon 0
>     nr_active_anon 0
> Node 0, zone   Normal
>   pages free     50199731
>         min      11122
>         low      13902
>         high     16683
>         scanned  0
>         spanned  66256896
>         present  65351040
>     nr_free_pages 50199731
>     nr_inactive_anon 16490
>     nr_active_anon 7191
> Node 1, zone   Normal
>   pages free     57596396
>         min      11265
>         low      14081
>         high     16897
>         scanned  0
>         spanned  67108864
>         present  66191360
>     nr_free_pages 57596396
>     nr_inactive_anon 10839
>     nr_active_anon 1772
>
> and after the generators were run (slow):
>
> Node 0, zone      DMA
>   pages free     3931
>         min      0
>         low      0
>         high     0
>         scanned  0
>         spanned  4095
>         present  3832
>     nr_free_pages 3931
>     nr_inactive_anon 0
>     nr_active_anon 0
> Node 0, zone    DMA32
>   pages free     105973
>         min      139
>         low      173
>         high     208
>         scanned  0
>         spanned  1044480
>         present  822056
>     nr_free_pages 105973
>     nr_inactive_anon 0
>     nr_active_anon 0
> Node 0, zone   Normal
>   pages free     23045
>         min      11122
>         low      13902
>         high     16683
>         scanned  0
>         spanned  66256896
>         present  65351040
>     nr_free_pages 23045
>     nr_inactive_anon 16486
>     nr_active_anon 5839
> Node 1, zone   Normal
>   pages free     33726
>         min      11265
>         low      14081
>         high     16897
>         scanned  0
>         spanned  67108864
>         present  66191360
>     nr_free_pages 33726
>     nr_inactive_anon 10836
>     nr_active_anon 1065
>
> Looking the same way at /proc/zoninfo while a test is running showed
> the "pages free" and "nr_free_pages" values oscillating downward to
> a low of about 28000 for Node 0, zone Normal.  The rest of the values were
> essentially stable.
>
> Looking the same way at /proc/meminfo while a test is running gave values
> that differed in only minor ways from the "after" table shown above.
> MemFree varied in a range from abut 680000 to 720000.
> Cached dropped to ~482407184 kB and then budged barely at all.
>
> Finally the last few lines from "sar -B" (sorry about the wrap)
>
> 10:30:03 AM   5810.55 301475.26     95.99      0.05  51710.29  48086.79
>   0.00  48084.94    100.00
> 10:40:01 AM   3404.90 185502.87     96.67      0.01  47267.84  44816.30
>   0.00  44816.30    100.00
> 10:50:02 AM      9.13     13.32    192.24      0.11   4592.56     48.54
>  3149.01   3197.55    100.00
> 11:00:01 AM    191.78      9.97    347.56      0.13  16760.51      0.00
>  3683.21   3683.21    100.00
>
> 11:00:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
> pgscand/s pgsteal/s    %vmeff
> 11:10:01 AM     11.64      7.75    342.59      0.09  18528.24      0.00
>  1699.66   1699.66    100.00
> 11:20:01 AM      0.00      6.75     96.87      0.00     43.97      0.00
>   0.00      0.00      0.00
>
> The generators finished at 10:35.  The data point at 10:30 (while they
> were running)  pgscank/s and pgsteal/s jumped up from 0 to high values.
> When later tests were run the former fell down to not much but the latter
> stayed high.  Additionally when the test runs were made following the
> generator it pushed pgscand/s from 0 to several thousand per second.  The
> last row consists of a 10 minute span where no tests were run, and these
> values all dropped back to zero.
>
> Since excessive file cache seems to implicated did this:
>  echo 3 > /proc/sys/vm/drop_caches
>
> and reran the test on node 20.  It was fast.
>
> I guess the question now is what parameter(s) control(s) the conversion
> from memory in file cache to memory needed for other purposes when free
> memory is in short supply and there is substantial demand.  It seems the OS
> isn't releasing cache.  Or maybe it isn't flushing it to disk.  I don't
> think it's the latter because iotop and iostat don't show any activity
> during a "slow" read.
>
> Thank,
>
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150709/9e098a7e/attachment-0001.html>