Super-scaling

Fri Feb 21 06:07:29 PST 2003

On Thu, 20 Feb 2003, Jakob Oestergaard wrote:

> Still, going from 5% of the working-set in L2, to perhaps 20% of the
> working set in L2, may still provide you with a generous speedup.
> 
> I don't think there are rules of thumb.  The factors and conditions I've
> summarized here are cude at best, but should give you an idea about what
> the scalability game is all about  :)
> 
> Move the bulk of your working-set data up the hierarchy (from disk to
> RAM, from RAM to L2, ...), avoid communication where possible, adding
> more computational work to the overall problem may still aid the
> speedup, if it helps avoiding communication or memory consumption.
> 
> Measure, think, experiment... Repeat as necessary.
> 
> Everyone feel free to supplement, comment and correct  ;)

Actually, beautifully said -- I'll save this in my directory of
beautiful list answers for sure.

The only thing I'd add is a little chunky from the online book (in
latex, which is actually pretty easy to read as math).  The table
referred to is a table of approximate memory hierarchy speeds (as of a
few years ago -- I don't know if they are still current).  L1 access in
about a nanosecond; L2 access in about 8 nanoseconds, main memory in
about 50 nanoseconds.  Then:

<quote>

Suppose $p_i$ is the probability that your program finds the next piece
of data or code it is looking for in the $i$th kind of ``memory'' in the
table above.  Then the average latency (the average time the system has
to wait before the data or code becomes available) is:
\begin{equation}
<L> = \sum_i p_i L_i.  \label{avg_latency}
\end{equation}

For example, if $p_{L1} = 0.05$, $p_{L2} = 0.9$, and $p_{M} = 0.05$ for
a problem that fits into main memory, the average latency might be
something like:
\begin{equation}
<L> = 0.05*1 + 0.9*8 + 0.05*50 = 9.75
\end{equation} 
in nanoseconds.  Things get a little more complicated trying to
determine the overall data access rate (I'd rather not get into a full
discussion of EDO, SDRAM, Rambus, and so forth at this moment and
instead will just give you a web reference\footnote{See
http://pclt.cis.yale.edu/pclt/PCHW/CPUMEM.HTM}.  However, the bottom
line is that as long as the hierarchy of your hardware accomplishes this
efficiently (maintaining an average latency and bandwidth reasonably
close to that of the L1 and/or L2 cache) {\em for your code} there is no
reason to invest in a more expensive hierarchy.

<quote>

This equation gives you an idea of when one might see superlinear
speedup, and why seeing it is actually remarkably rare outside of
deliberate code reorganizations such as ATLAS.  In most programs, one
has fairly limited control over how data is blocked out and accessed.
I'd argue that MOST programs are written in e.g. C, C++, Fortran to be
readable and maintainable and optimized only at a coarse level (such as
unrolling loops).  Compilers are generally not microarchitecture aware
and have no idea what the sizes of e.g. the L1 and L2 caches are; nor
are they algorithm aware and do not reblock vector operations and
algorithms to optimize for the cache.  This is what makes ATLAS such a
fascinating, and extremely advanced, concept, one that is WAY ahead of
its time.

The "best" way to realize superlinear speedup in most programs (except
for those that swap to disk, since that's so slow that any split that
doesn't swap will speed up by orders of magnitude) is to use ATLAS for
all your linear algebra.  You would have to work very hard, and very
knowledgeably (in terms of the architecture of CPU, cache, memory) to
equal ATLAS's optimization efforts.  ATLAS actually dynamically measures
cache speeds and sizes in situ during the build process, running what
amounts to a search algorithm for the optimal blocking and algorithm set
(since it turns out that optimal algorithms themselves alter as a
function of program size and relative cache speed).  ATLAS will actually
change algorithm as necessary automagically, as a function of
vector/matrix size.  IIRC, it manages a speedup of 2-3 on a lot of
linear algebra code, which is a tremendous achievement (like turning
your 2 GHz CPU into a 6 GHz CPU, relative to running unoptimized).

If your code doesn't USE a lot of linear algebra (but still does lots of
loops over large data sets) then your second best approach is to try to
get a feel for the degree of memory locality and block things out to
maximize the probabilities in the latency equation above in the fastest
kinds of memory.  This too can lead to odd looking, non-intuitive code.
For example, when looping over sites in a lattice model in d dimensions,
neighboring sites in the lattice are typically NOT neighboring in
physical memory (where they are inevitably unrolled in a vector of some
sort).  The separation of neighboring sites in the vector is typically
on the order of L^(d-1) where d is the dimension and L the length of a
side -- one only if the lattice is 1 d (linear vector), L if 2d (square
lattice), L^2 for 3d (cubic lattice) and so forth.  If the L1 data cache
size is e.g. 32K, then AT MOST 4K of doubles can fit in it at a time.
The square root of 4096 (2^12) is 2^6 or 64.  Hence for cubic lattices
of size L=64, it is borderline impossible to hold a lattice site AND ITS
NEAREST NEIGHBORS in L1 at the same time while looping over sites.

To "fix" this, one has to work very hard to break the big lattice up
into chunks that DO fit into L1 -- perhaps cube root of 4096 or 2^4 or L
= 16 on a side -- and loop over each subchunk.  Then one pays a
relatively heavy penalty for the surface atoms, which are now guaranteed
to be nonlocal in L1 (but which might be local in L2.  Further out,
there are boundaries that are guaranteed to be nonlocal in L2 for "every
access" and so forth.  One picks up a surface to volume advantage,
however, over a cache hit per lattice site access -- (L-2)^3 sites are
guaranteed local (presuming that the entire block is loaded into L1 at
once), L^3 - (L-2)^3 (the surface sites) are guaranteed nonlocal.

Except that there is no such guarantee as one has little control over
how L1 is emptied and filled and L1 data will almost certainly have
other variables in besides the lattice and THEN there is L1 code, which
might be nonlocal as hell if you execute subroutine calls on the lattice
sites involving long jumps locally allocate memory or other global
vectors, and so forth.  Ultimately it becomes VERY difficult to
theoretically predict exactly how cache is going to fill and empty in a
given code block, which is why ATLAS utilizes an empirical search for
the optimal blocking and algorithm in the neighborhoods of the expected
boundaries instead of just presuming that it can compute them exactly
even from a detailed knowledge of the microarchitecture.

This is precisely analogous to the process that would show superlinear
scaling on parallelized code, with the same tradeoffs, and with an
additional set of latencies (the network latencies) associated with
accessing "memory" on node A that resides physically on node B.  Lattice
decomposition can often be accomplished across a cluster with the same
surface to volume penalty on IPC's.  This favors running the BIGGEST
POSSIBLE sublattice that will fit in memory on each node, and favors
blocking that sublattice such that its sub-sublattices will fit in L2,
and its sub-sub-sublattices will fit in L1.

All of which will VERY LIKELY render the underlying code impossible for
humans to read and understand and maintain.  Especially if one is truly
compulsive and hand-optimizes the core loops in assembler for maximal
speed and control!  There has to be a very large economic payoff in
terms of speedup to be worth the cost of the human time, expertise,
energy required for this sort of optimization, and one has to be aware
that the resulting optimization isn't portable or general and will have
to be REdone to move your code from Intel to Athlon, from Celeron to P3
to P4, from SDRAM to DDR to RDR.  Generally there isn't any such
overwhelming payoff and generally portability is an important issue, so
generally it isn't done, or is done at the level of ATLAS, or perhaps is
done in very rough terms that provide one simple layer of moderately
portable L2 optimization (which can provide PROFOUND speedups, as noted
above) but not "perfectly" optimized.

One conclusion one can draw from the above is that EVEN WHERE
SUPERLINEAR SCALING IS EVIDENT on parallelized code, so Amdahl's law is
in fact violated, the large disparity between the longest possible real
memory latencies (tens of nanoseconds) and the shortest possible NETWORK
latencies (order of ten MICROseconds) means that one could almost
certainly eliminate the superlinear speedup resulting from splitting up
the code by optimizing the code for the local memory architecture FIRST.
If one can get a signficant speedup by splitting the (non-swapping) code
between machines with a network in between, one can almost certainly an
even GREATER speedup by blocking the code appropriately on the SINGLE
machine and paying a much smaller penalty across the boundaries.

The second conclusion to draw from this is that if you REALLY want to
write the fastest possible code, you need to abandon writing a "simple
translation" of the underlying operational theory into code and take up
serious computer science as a vocation.  You should probably obtain
ATLAS and read its associated design documents and white papers.
Dongarra and co are VERY SMART HUMANS.  You might want to get a book or
two on algorithms and high performance computing and parallel computing.
You might want to learn assembler.

Be sure, however, to do the FULL cost-benefit analysis for doing so.
Remember, you're going to spend a lot of human time (expensive)
optimizing for a given architecture, and will have to REoptimize for
each new architecture that comes along.  The cost of micro-optimization
(where it cannot be scalably and universally done and encapsulated in
library calls a la ATLAS) is going to be large and paid repeatedly over
a long time.  The potential benefit is also large (factors of two or
more in speed in many cases) but it is RUN time (where you as a human
"wait" for results but can do lots of other useful things while you are
waiting like teach classes and write grants and have a life) versus YOUR
time writing, rewriting, tuning, tinkering with your code and quite
possibly breaking it and introducing Deep Bugs that are difficult to
root out in all the new complexity.

I'd say that most folks balk at this for very sound reasons.  Hence most
folks are content with the modest optimizations available at the
compiler level and perhaps an ATLAS layer for linear algebra dominated
code, and just live with the "probability" estimates of mean latency
like the one given oh-so-far above.

One of my personal fantasies, and a truly excellent project for a really
ambitious CPS graduates student, would be to create a numerical code
optimization engine and API between the kernel and applications
programmers.  For example, it would be VERY INTERESTING to abstract the
optimization steps of ATLAS into a matrix of parameters and create a
daemon or kernel module that generates those parameters on the first
boot of the kernel and periodically refines them, adding more parameters
as needed by other kinds of operational optimization.  Then it might be
possible to generate a variety of libraries that READ those parameters
when the code starts up and autotune.  Indeed, it might even be possible
to interface the compiler itself with the API and create an "O7" level
of optimization that actually reblocks dynamically in a way that
optimally utilizes the memory hierarchy.  Might or might not be
possible, but a very interesting concept...

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu