jakob at unthought.net
Thu Feb 20 14:56:14 PST 2003
On Wed, Feb 19, 2003 at 01:30:22PM +0000, Simon Hogg wrote:
> Is there a rough rule of thumb which dictates when a program (if ever)
> shows superscaling with number of nodes. Of course, I would not expect
> this to carry on ad infinitum, but does anyone see superscalar behaviour up
> to, a certain number of nodes.
> What would be the conditions for this to occur?
When splitting up a sequential problem to run on multiple processors,
various conditions will determine whether the scalability is sub-linear,
linear or super-linear:
This works against scalability:
*) Required communication between the sub-problem solvers (your nodes)
*) Non-uniform breakdown of the problem (some nodes will wait for other
nodes that take longer to complete their sub-problem)
This works for scalability:
*) By adding nodes, sub-problems are solved in parallel
*) sub-problems are smaller than the full problem, and can thus be
It is the last condition that can result in super-linear speedup. Since
modern computers are *not* Von Neumann machines, solving two half
problems may be significantly faster than solving one full problem.
More specifically; executing an instruction takes some amount of time,
but getting the data input required for executing the instruction is
often what limits your performance.
Modern machines have a hierarchial memory configuration - you have
registers, L1 cache, L2 cache, sometimes L3 cache, RAM, disk cache, and
By splitting up your problem, you may be able to use less disk access
and more RAM access. Or more L2 cache and less RAM access.
L2 is many times slower than L1. RAM is many times slower than L2. Disk
is many orders of magnitude slower than RAM.
If, by splitting up your problem, your machines can keep the entire
working set in RAM instead of 50% in RAM and 50% on disk (swap, scratch
files, etc.) - you will see a *massive* speedup. Thus, solving two
sub-problems may be much faster than solving the full problem in one go.
Of course, splitting up a problem into sub-problems, may require more
work (counted in numbers of instructions). It may require
communication, it may result in non-uniform balancing of load. Many
things work against you.
I have parallelized a handfull of existing simulation programs, and only
seen super-linear speedup once. It's not terribly uncommon I think, but
you cannot assume that any problem will exhibit this behaviour. And for
many problems, there is just no way that you can ever achieve this.
The disk-bound problems are the easiest ones, I think. Because getting
rid of the disk access will buy you several orders of magnitude in speed
increase (thus, the overall problem will be forgiving, even introducing
five times as much computational work in order to split up the problem,
may still buy you super linear speedup).
Getting things to fit in L2 is a lot harder, and I doubt that you will
be able to parallelize many real-world problems to the extent where each
sub-problem working set will fit in L2. L2 usage can be optimized using
packages such as Atlas, for blocked linear algebra routines, for
Still, going from 5% of the working-set in L2, to perhaps 20% of the
working set in L2, may still provide you with a generous speedup.
I don't think there are rules of thumb. The factors and conditions I've
summarized here are cude at best, but should give you an idea about what
the scalability game is all about :)
Move the bulk of your working-set data up the hierarchy (from disk to
RAM, from RAM to L2, ...), avoid communication where possible, adding
more computational work to the overall problem may still aid the
speedup, if it helps avoiding communication or memory consumption.
Measure, think, experiment... Repeat as necessary.
Everyone feel free to supplement, comment and correct ;)
: jakob at unthought.net : And I see the elder races, :
:.........................: putrid forms of man :
: Jakob Østergaard : See him rise and claim the earth, :
: OZ9ABN : his downfall is at hand. :
More information about the Beowulf