[Beowulf] Re: vectors vs. loops

Robert G. Brown rgb at phy.duke.edu
Tue May 3 06:50:15 PDT 2005


On Tue, 3 May 2005, Joachim Worringen wrote:

> Robert G. Brown wrote:
> > However, most code doesn't vectorize too well (even, as you say, with
> > directives), so people would end up getting 25 MFLOPs out of 300 MFLOPs
> > possible -- faster than a desktop, sure, but using a multimillion dollar
> > machine to get a factor of MAYBE 10 in speedup compared to (at the time)
> > $5-10K machines.  In the meantime, I'm sure that there were people who
> > had code that DID vectorize well pulling their hair because of all those
> > 100 hour accounts that basically wasted 90% of the resource.
> 
> This general statement is just wrong. Many scientific codes *do* vectorize well, 
> and in this case, you do not get <10% of peak as you indicated, but typically 30 
> to 50% or even more (see i.e. Leonid Oliker's recent papers on this topic) for 
> the *overall application* (the vectorized loop alone is close to 100%) . This is 
> a significant difference.

Ah, an interesting discussion.

I don't think we are really disagreeing, except possibly in regard to
"many" vs "most".  I'm including, for example, many of the
bioinformatics applications, simulations of various sorts, applications
with a lot of decisioning and nonlocality in the code in the group that
doesn't.  Then there is a big block of scientific applications where
SOME parts of the main loop vectorize (with quite possibly a huge
speedup) but where the overall speedup is limited (as you note) to 30%
to 50%.  Finally sure, there are plenty of applications out there that
are dominated by e.g.  simple floating point arithmetic and linear
algebra, and those folks know perfectly well who they are and quite
correctly lust after vector machines or machines with a really efficient
vector co-processor.  

>>I<< lust after machines with a really efficient vector co-processor --
as long as I don't have to pay 20x as much for them and/or spend six
months hacking my code so that they can use them for an eventual 30-50%
speedup.

Also note the context of the comment above -- I was referring to my own
experiences with the Cray Y-MP order of A DECADE AGO.  The Cray sped my
code up by close to a factor of 10 -- much more than "30-50%" --
compared to contemporary workstations (partly because the Cray's memory
and CPU was just plain faster even for purely scalar code). However, the
COST factor of the machine compared to the desktop workstation I was
using was close to 1000, I was getting a pitiful fraction of its
theoretical peak, it took me a big chunk of my 100 "free" hours to get
the code to where it sped up THAT much, and there were people who were
doing linear-algebra dominated work who had to give up those 100 hours
to people like me who shouldn't have been there "wasting" a precious and
expensive resource.

My experience was far from unique, and it is one of the things that
drove the cluster/beowulf revolution.  10 Sun workstations, even at a
cost of (say) $80-100K at the time, equalled the throughput of the Cray
for MY code, and I could run on them 24x7, and people could use them as
desktop computers at the same time I was using their spare CPU at
maximum niceness in the background.  It was no contest.  It still is --
far, far more code parallelizes well (or parallelizes AND vectorizes
well in a suitable problem partitioning) than merely vectorizes well.
This is why big, expensive, single-threaded vector supercomputers are
very, very rare compared to grids, beowulfs, NOWs, COWS, and even WOWS
(World of Warcraft clusters:-).  Even vector machines come in clusters,
do they not?

So I certainly wasn't trying to argue that vectorization isn't a
desireable thing to have available (ideally cheaply and transparently)
-- it IS available cheaply and transparently in most modern CPU designs
although only to a limited extent, and I think most of us would love to
see it even better supported.  Only that the full benefits of a "vector
machine" (supercomputer) were, and are today, a lot more limited and
difficult to obtain, and continue to have a significant cost premium
compared to OTS vanilla clusters with much better cost/performance
scaling for MOST scientific problems.  You can argue with that if you
like, but the marketplace speaks for itself, I think.  If "most"
scientific problems could be solved "most" cost-beneficially (in a TCO
sense) on vector hardware, "most" scientists making free choices of the
hardware to buy would be buying vector hardware... or Adam Smith's
invisible hand has gotten arthritic with age;-)

> Another 'legend' comes up in your statement: a single node of a vector machine 
> does not cost "multimillion" dollars. The price factor is quite close to the 
> performance factor for suitable applications.

As noted above and (I thought clearly) in the original comment, the
remark referred to my HISTORICAL EXPERIENCE WITH THE CRAY Y-MP (which
did indeed cost millions of dollars plus millions more a year to run).
Nowadays (as I also noted) vector discussions on this list involve
everything from clusters of playstations or OTS systems with add-on DSPs
or GPUs, through SSE instructions and pipelines on modern CPUs (which,
as I've been corrected offline, are currently a LOT deeper than 2-3
instructions) to dedicated vector systems.  Some of these are cheap but
require lots of hand work and code customization.  Others are
(relatively) expensive but commercially supported, in particular with
decent compilers, suitable memory, etc.  I frankly acknowledge my
inability to compute cost-benefit across this entire spectrum of
hardware possibilities matched to code, and freely agree that there are
very likely areas where vector machines homemade or commercial are a
good match.  I don't think I implied otherwise in my original note --
only that there are plenty of people, probably a majority of people, for
whom either alternative is a knee jerk bad match compared to sticking
with OTS hardware.

Nevertheless, how can I (or anyone) argue with a statement ending with
"for suitable applications"?  So helpless to do otherwise, I agree.

Still, the marketplace speaks for itself.  It doesn't argue, and isn't
legendary, it just is.

> > My own code doesn't vectorize too well because it isn't heavy on linear
> > algebra and the loops are over 3 dimensional lattices where
> > nearest-neighbor sums CANNOT be local in a single memory stream and
> 
> Real vector architectures have very efficient scatter/gather memory operations, 
> and support indirect addressing efficiently as well.

The indirect addressing issue is one that is very important to me.  By
the time I ported my code (transiently) to the Cray I had given up
Fortran forever.  I didn't, and don't, and won't ever code in Fortran
any more.  So another bitch I >>had<< (note well, PAST TENSE) with the
Cray is that one could (apparently) only get really good performance
with Fortran code because the C compiler didn't do too terribly well
vectorizing loops filled with C pointers addressing dynamically
allocated and hand-repacked memory.  It could only properly manage
"traditional" allocations of fixed arrays and vectors, or so I was told
at the time when I asked why I wasn't getting even 100 MFLOPs on a 300
MFLOP architecture computed in terms of relative speedup of my code.

So I'm glad to hear that this is now fixed.  I'd be very interested in
seeing how my code runs on a modern vector architecture with a simple
recompile.  As I said, in order to achieve a change in the number of
steps per statistical sample from ~L^2 to ~L (a factor of 50-100 speedup
at the scales I study) I use a Monte Carlo process that basically
"grows" a cluster via an accept-reject process in addition to a simply
loop over cluster sites in lattice order.

In order to vectorize the simple loop (non-cluster) part, the processor
would have to track and preload the local neighbors of a site (easy),
local neighbors offset by a row (but generally on the same memory page,
so perhaps still easy), local neighbors offset by a plane (several pages
away in both directions) in a running fashion.  Inside the main loop the
processor has to do trancendental calls, division, and for some
algorithms a conditional, per site -- these seem to slow the
"computation" relative to memory access times per site to the point
where it is CPU bound on nearly all architectures I've run it on, giving
memory time to keep up even on OTS systems with relatively slow memory.

The cluster part requires that memory locations that are offset by one
more level of indirection be constantly loaded.  I cannot advance the
core loop along the "easy" local direction, followed by the row-offset
direction, finally the plane-offset direction (or use a checkerboard
algorithm or anything else that improves vector efficiency).  The
cluster grows across rows or across planes from any given site tested as
often as it grows in the easy/efficient direction, and biasing the
computation otherwise (if it were even possible) would be VERY difficult
to program.

This in a computation on a lattice, where one might EXPECT a possibly
significant vector speedup (and maybe on a modern vector architecture
I'd get one, who knows?).  I'd be happy if I could just run it on a
processor that could significantly increase the rate at which
trancendentals and division were accomplished, as so far I seem to be
less memory bound than processor bound on these steps.  And it may well
be that a vector processor DOES do transcendentals very efficiently, or
generate uniform deviates particularly efficiently, or do division
particularly efficiently (or just plain do floating point arithmetic
relatively efficiently) as vector processors tend to dedicate more
processor real estate to doing these things fast at the expense of
something else.  Any of these would speed up my code independent of the
memory speed, maybe to where it WAS memory bound.

This last bit is directly related to the point I was making in the
previous note.  To get optimal performance on ANY architecture you have
to study your code, and study the architecture, and maybe alter your
code for the architecture to take advantage of its features.  If you
just write a direct translation of the equations into code, you'll
likely end up with code that is easy to understand, easy to debug, easy
to test, but relatively slow (almost certainly not optimally fast).  The
more you mess with reordering and blocking loops, "parenthesizing" steps
in the core computation, and so forth, the faster it runs (for a given
target architecture) but the more difficult it gets to understand the
code, debug the code, validate the code, and when one changes
architectures one has to do all of the above all over again.  This is
the "law" of code tuning that Joseph (IIRC) cited a bit earlier in the
thread.

Many of us who do scientific work choose to keep our code "close" to a
fairly direct implementation of the underlying theory because otherwise
over TIME we find ourselves constantly rewriting and retuning and
retesting the code for new architectures as old ones fade away, and this
ends up being a significant expense in a code that might take a year or
more to develop, debug, validate in the first place and upon which you
wish to base publications on which your professional reputation rests.
"Safe" can sometimes be worthwhile at the expense of "slow".  One very
appealing thing about cluster architectures is that one can, with MPI or
PVM, write code that is STILL pretty safe and get near-linear speedup
across a wide range of "generic" architectures without having to
necessarily completely rewrite and retune and retest your code every
time a new processor or network or system is released.  When I started
running on 64 bit Opterons (relative to 32 bit AMD or Intel CPUs), I did
so by means of a simple "make".  Do I get all of the benefit of the
architecture?  I doubt it, but I get a lot, and I don't have to
instrument my code with different versions for the 32 bit and 64 bit
CPUs.  And it is obviously easy to validate.  As a final benefit, by
keeping the code generic I can take advantage not only of MY nodes but
of ANYBODY'S nodes -- in a grid or extended cluster environment I can
suck up otherwise idle resources and put them to good use.

Would this be true for a vector system?  Could I just drop my generic,
repacked-pointer laden C-based code filled with divisions and
trancendental calls onto a vector system, type in "make", and have it
build into efficient code with a really significant speedup?  Would the
speedup justify the additional cost of the vector system (presuming that
it would cost more at all in TCO) given that I can already double my
effective speed by just doubling the number of nodes at any time I can
afford them?  That is, if a vector node costs 2x as much as an Opteron,
it has to run the code at least 2x as fast to be worth it to me as
otherwise I'm better off buying the second Opteron.

I don't know, but I'm very interested in the issue.  I'm especially
interested in the possibility of writing (what's the acronym?) SANA --
self adaptive numerical algorithms -- that by one means or another learn
about the architecture on which they are running and then automagically
retune for the architecture, so one can in principle write code that IS
portable and yet at least somewhat optimized per architecture.  ATLAS
stands as a shining beacon pointing out the way here, but the field is
in its infancy as far as I can tell (after a conversation with Jack
Dongarra a few weeks ago) with only a few people doing active research
in it.  This would still require instrumentation of the code, but the
instrumentation would be more of the portable and
non-architecture-specific kind, e.g. 

if(l2cachesize <= 256 &&  memory_latency > 40){
  usesmallcachealgorithm();
} else {
  usebigcachealgorithm();
}

sort instead of

#ifdef CELERON
...

or

#ifdef INTEL_XEON_P4
...

or anything else that is an obviously architecture-specific
instrumentation that will one day become mere cruft in your code, or the
dynamical-per-architecture build and tune of ATLAS.

   rgb

> 
>   Joachim
> 
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list