[Beowulf] Re: Feedback on large pages in Linux (hahn at physics.mcmaster.ca)

Wed Jul 26 15:55:29 PDT 2006

On Wed, 26 Jul 2006, Kevin Pedretti wrote:

>
>>> below).  So, an app that accesses lots of little regions of memory
>>> scattered all over the place will probably be hurt by using large
>>> pages.
>>
>> I find that statement a bit misleading; consider a case where I'm
>> iterating through a 16M region, touching 1 word at 4k strides.
>> 8x2M pages will be golden, whereas small pages would thrash badly.
>
> My statement was too general.  Thanks for clearing it up with the two
> examples (this one and in your other reply).
>
> There are real apps that perform better with small pages on XT3.  This
> is with the exact same *physical* memory layout, only difference being
> the page size used.  I wouldn't expect this to happen if there were the
> same number of 2-Mbyte and 4-Kbyte TLB entries.

Isn't there a fairly general rule of thumb, that systems that get a
large benefit from streaming I/O through a cache of any size pay a
penalty in terms of repeated flushes for sufficiently NON-streaming
(random or variable stride) I/O?  As in it takes a lot longer to flush
and reload larger pages than smaller ones, and if you're bouncing a lot
in your memory references you are thrashing in any event but the COST of
thrashing is higher for larger caches than for small?

I've got a lovely little benchmark in benchmaster (was cpu_rate) that
demonstrates this fairly graphically.  Basically you can run a simple
ram R/W/RW test one of two ways -- streaming with selected stride
(default 1) or with a clever little algorithm for its addressing that
preloads the vector addresses you read with the address of the next data
element to be read (so that you can either shuffle those addresses from
a large list with effectively random stride forward or backward or step
neatly through the vector).

There is a REALLY REALLY BIG difference in the rates, as one might
expect.  I haven't verified that it is BIGGER with larger pages because
I can't, but I'll bet that it is.  It might not vary negatively with the
actual size of the cache, though (as noted above) because naturally a 16
MB cache can hold a pretty big chunk -- even all-- of MANY streaming
lists one might test and once in cache the relative negative penalty of
a backwards jump or jump with random stride is somewhat reduced
(remembering that there is often still an L1 data cache that is much
smaller that may need to be flushed as you bounce around).

Or is this all crazy?  Greg?  You seem to be the resident expert on low
level CPU hardware these days (working for a compiler company, after
all;-)...

     rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu