[Beowulf] big read triggers migration and slow memory IO?

Thu Jul 9 11:44:09 PDT 2015

Reran the generators and that did make the system slow again, so at 
least this problem can be reproduced.

After those ran memory is definitely in short supply, pretty much 
everything is in file cache.  For whatever reason, the system seems to 
be loathe to release memory from file cache for other uses.  I think 
that is the problem.

Here is some data, this is a bit long...

numactl --hardware ho
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 
42 44 46
node 0 size: 262098 MB
node 0 free: 18372 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 
43 45 47
node 1 size: 262144 MB
node 1 free: 2829 MB
node distances:
node   0   1
   0:  10  20
   1:  20  10

CPU specific tests were done on 20, so NUMA node 0.  None of the tests 
come close to using up all the physical memory in a "node", which is 
262GB.

When cache has been cleared, and the test programs run fast:
cat /proc/meminfo | head -11
MemTotal:       529231456 kB
MemFree:        525988868 kB
Buffers:            5428 kB
Cached:            46544 kB
SwapCached:          556 kB
Active:            62220 kB
Inactive:         121316 kB
Active(anon):      26596 kB
Inactive(anon):   109456 kB
Active(file):      35624 kB
Inactive(file):    11860 kB

run one test and it jumps up to

MemTotal:       529231456 kB
MemFree:        491812500 kB
Buffers:           10644 kB
Cached:         34139976 kB
SwapCached:          556 kB
Active:         34152592 kB
Inactive:         130400 kB
Active(anon):      27560 kB
Inactive(anon):   109316 kB
Active(file):   34125032 kB
Inactive(file):    21084 kB

and the next test is still quick.  After running the generators, but 
when nothing much is running, it starts like this:

cat /proc/meminfo | head -11
MemTotal:       529231456 kB
MemFree:        19606616 kB
Buffers:           46704 kB
Cached:         493107268 kB
SwapCached:          556 kB
Active:         34229020 kB
Inactive:       459056372 kB
Active(anon):        712 kB
Inactive(anon):   135508 kB
Active(file):   34228308 kB
Inactive(file): 458920864 kB

Then when a test job is run it drops quickly to this and sticks. Note 
the MemFree value.  I think this is where the "Events/20" process kicks 
in:

cat /proc/meminfo | head -11
MemTotal:       529231456 kB
MemFree:          691740 kB
Buffers:           46768 kB
Cached:         493056968 kB
SwapCached:          556 kB
Active:         53164328 kB
Inactive:       459006232 kB
Active(anon):   18936048 kB
Inactive(anon):   135608 kB
Active(file):   34228280 kB
Inactive(file): 458870624 kB

Kill the process and the system "recovers" to the preceding memory 
configuration in a few seconds.  Similarly /proc/zoneinfo values from 
before the generators were run, when the system was fast:

  extract -in state_zoneinfo_fast3.txt -if '^Node' -ifn 10  -ifonly
Node 0, zone      DMA
   pages free     3931
         min      0
         low      0
         high     0
         scanned  0
         spanned  4095
         present  3832
     nr_free_pages 3931
     nr_inactive_anon 0
     nr_active_anon 0
Node 0, zone    DMA32
   pages free     105973
         min      139
         low      173
         high     208
         scanned  0
         spanned  1044480
         present  822056
     nr_free_pages 105973
     nr_inactive_anon 0
     nr_active_anon 0
Node 0, zone   Normal
   pages free     50199731
         min      11122
         low      13902
         high     16683
         scanned  0
         spanned  66256896
         present  65351040
     nr_free_pages 50199731
     nr_inactive_anon 16490
     nr_active_anon 7191
Node 1, zone   Normal
   pages free     57596396
         min      11265
         low      14081
         high     16897
         scanned  0
         spanned  67108864
         present  66191360
     nr_free_pages 57596396
     nr_inactive_anon 10839
     nr_active_anon 1772

and after the generators were run (slow):

Node 0, zone      DMA
   pages free     3931
         min      0
         low      0
         high     0
         scanned  0
         spanned  4095
         present  3832
     nr_free_pages 3931
     nr_inactive_anon 0
     nr_active_anon 0
Node 0, zone    DMA32
   pages free     105973
         min      139
         low      173
         high     208
         scanned  0
         spanned  1044480
         present  822056
     nr_free_pages 105973
     nr_inactive_anon 0
     nr_active_anon 0
Node 0, zone   Normal
   pages free     23045
         min      11122
         low      13902
         high     16683
         scanned  0
         spanned  66256896
         present  65351040
     nr_free_pages 23045
     nr_inactive_anon 16486
     nr_active_anon 5839
Node 1, zone   Normal
   pages free     33726
         min      11265
         low      14081
         high     16897
         scanned  0
         spanned  67108864
         present  66191360
     nr_free_pages 33726
     nr_inactive_anon 10836
     nr_active_anon 1065

Looking the same way at /proc/zoninfo while a test is running showed
the "pages free" and "nr_free_pages" values oscillating downward to
a low of about 28000 for Node 0, zone Normal.  The rest of the values 
were essentially stable.

Looking the same way at /proc/meminfo while a test is running gave 
values that differed in only minor ways from the "after" table shown 
above.  MemFree varied in a range from abut 680000 to 720000.
Cached dropped to ~482407184 kB and then budged barely at all.

Finally the last few lines from "sar -B" (sorry about the wrap)

10:30:03 AM   5810.55 301475.26     95.99      0.05  51710.29  48086.79  
     0.00  48084.94    100.00
10:40:01 AM   3404.90 185502.87     96.67      0.01  47267.84  44816.30  
     0.00  44816.30    100.00
10:50:02 AM      9.13     13.32    192.24      0.11   4592.56     48.54  
  3149.01   3197.55    100.00
11:00:01 AM    191.78      9.97    347.56      0.13  16760.51      0.00  
  3683.21   3683.21    100.00

11:00:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s 
pgscand/s pgsteal/s    %vmeff
11:10:01 AM     11.64      7.75    342.59      0.09  18528.24      0.00  
  1699.66   1699.66    100.00
11:20:01 AM      0.00      6.75     96.87      0.00     43.97      0.00  
     0.00      0.00      0.00

The generators finished at 10:35.  The data point at 10:30 (while they 
were running)  pgscank/s and pgsteal/s jumped up from 0 to high values.  
When later tests were run the former fell down to not much but the 
latter stayed high.  Additionally when the test runs were made following 
the generator it pushed pgscand/s from 0 to several thousand per second. 
  The last row consists of a 10 minute span where no tests were run, and 
these values all dropped back to zero.

Since excessive file cache seems to implicated did this:
  echo 3 > /proc/sys/vm/drop_caches

and reran the test on node 20.  It was fast.

I guess the question now is what parameter(s) control(s) the conversion 
from memory in file cache to memory needed for other purposes when free 
memory is in short supply and there is substantial demand.  It seems the 
OS isn't releasing cache.  Or maybe it isn't flushing it to disk.  I 
don't think it's the latter because iotop and iostat don't show any 
activity during a "slow" read.

Thank,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech