[Beowulf] Re: vectors vs. loops

Wed Apr 27 10:15:42 PDT 2005

On Wed, 27 Apr 2005, Ben Mayer wrote:

> > However, most code doesn't vectorize too well (even, as you say, with
> > directives), so people would end up getting 25 MFLOPs out of 300 MFLOPs
> > possible -- faster than a desktop, sure, but using a multimillion dollar
> > machine to get a factor of MAYBE 10 in speedup compared to (at the time)
> > $5-10K machines.
> 
> What the people who run these centers have told me that a
> supercomputer is worth the cost if you can get a speed up of 30x over
> serial. What do others think of this?

I personally think that there is no global answer to this question.
There is only cost-benefit analysis.  It is trivially simple to reduce
this assertion (by the people who run the centers, who are not exactly
unbiased here:-) to absurdity for many, many cases.  In either direction
-- for some it might be worth it for a factor of 2 in speedup, for
others it might NEVER be worth it at ANY speedup.

For example, nearly all common and commercial software isn't worth it at
any cost.  If your word processor ran 30x faster, could you tell?  Would
you care?  Would it be "worth" the considerable expense of rewriting it
for a supercomputer architecture to get a speedup that you could never
notice (presuming that one could actually speed it up)?  

Sure it's an obvious exception, but the problem with global answers is
they brook no exceptions even when there are obvious ones.  If you don't
like word processor, pick a suitable rendered computer game (zero
productive value, but all sorts of speedup opportunities).  Pick any
software with no particular VALUE in the return or with a low
OPPORTUNITY COST of the runtime required to run it.

A large number of HP computations are in the latter category.  If I want
to run a simple simulation that takes eight hours on a serial machine
and that I plan to run a single time, is it worth it for me to spend a
month recoding it to run in parallel in five minutes?  Obviously not.
If you argue that I should include the porting time in the computation
of "speedup" then I'd argue that if I have a program that takes two
years to run without porting and that takes a six months to port into a
form that runs on a supercomputer in six months more, well, a year of MY
life is worth it, depending on the actual COST of the "supercomputer"
time compared to the serial computer time.  Even in raw dollars, my
salary for the extra year is nontrivial compared to the cost of
purchasing and installing a brand-new cluster just to speed up the
computation by a measley factor of two or four, depending on how you
count.

So pay no attention to your supercomputer people's pronouncement.  That
number (or any other) is pulled out of, uh, their nether regions and is
unjustifiable.  Instead, do the cost-benefit analysis, problem by
problem, using the best possible estimates you can come up with for the
actual costs and benefits.

That very few people EVER actually DO this does not mean that it isn't
the way it should be done;-)

> :) I needed to do some CHARMM runs this summer. The X1 did not like it
> much (neither did I, but when the code is making references to punch
> cards and you are trying to run it on a vector super, I think most
> would feel that way), I ended up running it in parallel by a similar
> method as yours. Worked great!

The easy way into cluster (or nowadays, "grid") computing, for sure.  If
your task is or can be run embarrassingly parallel, well, parallel
scaling doesn't generally get much better than a straight line of slope
one barring the VERY few problems that exhibit superlinear scaling for
some regime....;-)

> > If it IS a vector (or nontrivial parallel, or both) task, then the
> > problem almost by definition will EITHER require extensive "computer
> > science" level study -- work done with Ian Foster's book, Amalsi and
> > Gottlieb for parallel and I don't know what for vector as it isn't my
> > area of need or expertise and Amazon isn't terribly helpful (most books
> > on vector processing deal with obsolete systems or are out of print, it
> > seems).
> 
> So what we should really be trying to do is matching code to the
> machine. One of the problems that I have run into is that unless one
> is at a large center there are only one or two machines that provide
> computing power. Where I am from we have a X1 and T3E. Not a very good
> choice between the two. There should be a cluster coming up soon,
> which will give us the options that we need. ie Vector or Cluster.

No, what you SHOULD be doing is matching YOUR code to the cluster you
design and build just for that code.  With any luck, the cluster design
will be a generic and inexpensive one that can be reused (possibly with
minor reconfigurations) for a wide range of parallel problems.  If your
problem DOES trivially parallelize, nearly any grid/cluster of OTS
computers capable of holding it in memory on (even) sneakernet will give
you linear speedup.  

Given Cluster World's Really Cheap Cluster as an example, you could
conceivably end up with a cluster design containing nodes that cost
between $250 and $1000 each, including switches and network and shelving
and everything, that can yield linear speedup on your code.  Then you do
your cost-benefit analysis, trade off your time, the value of the
computation, the value of owning your own hardware and being able to run
on it 24x7 without competition, the value of being able to redirect your
hardware into other tasks when your main task is idle, any additional
costs (power and AC, maybe some systems administration, maintenance).
This will usually tell you fairly accurately both whether you should
build your own local cluster vs run on a single desktop workstation vs
run on a supercomputer at some center and will even tell you how many
nodes you can/should buy and in what configuration to get the greatest
net benefit.

Note that this process is still correct for people who have code that
WON'T run efficiently on really cheap node or network hardware; they
just have to work harder.  Either way, the most important work is
prototyping and benchmarking.  Know your hardware (possibilities) and
know your application.  Match up the two, paying attention to how much
everything costs and using real world numbers everywhere you can.  AVOID
vendor provided numbers, and look upon published benchmark numbers for
specific micro or macro benchmarks with deep suspicion unless you really
understand the benchmark and trust the source.  For example, you can
trust anything >>I<< tell you, of course...;-)

> The manual for the X1 provides some information and examples. Are the
> Apple G{3,4,5} the only processors who have real vector units? I have
> not really looked at SSE(2), but remember that they were not full
> precision.

What's a "real vector unit"?  On chip?  Off chip?  Add-on board?
Integrated with the memory and general purpose CPU (and hence
bottlenecked) how?

Nearly all CPUs have some degree of vectorization and parallelization
available on chip these days; they just tend to hide a lot of it from
you.  Compilers work hard to get that benefit out for you in general
purpose code, where you don't need to worry about whether or not the
unit is "real", only about how long it takes the system to do a stream
triad on a vector 10 MB long.  Code portability is a "benefit" vs code
specialization is a "cost" when you work out the cost-benefit of making
things run on a "real vector unit".  I'd worry more about the times
returned by e.g. stream with nothing fancy done to tune it than how
"real" the underlying vector architecture is.

Also, if your problem DOES trivially parallelize, remember that you have
to compare the costs and benefits of complete solutions, in place.  You
really have to benchmark the computation, fully optimized for the
architecture, on each possible architecture (including systems with
"just" SSE but perhaps with 64 bit memory architectures and ATLAS for
linear algebra that end up still being competitive) and then compare the
COST of those systems to see which one ends up being cheaper.  Remember
that bleeding edge systems often charge you a factor of two or more in
cost for a stinkin' 20% more performance, so that you're better off
buying two cheap systems rather than one really expensive one IF your
problem will scale linearly with number of nodes.

I personally really like the opteron, and would commend it to people
looking for a very good general purpose floating point engine.  I would
mistrust vendor benchmarks that claim extreme speedups on vector
operations for any code big running out of memory unless the MEMORY is
somehow really special.  A Ferrari runs as fast as a Geo on a crowded
city street.

As always, your best benchmark is your own application, in all its dirty
and possibly inefficiently coded state.  The vendor specs may show 30
GFLOPS (for just the right code running out of L1 cache or out of
on-chip registers) but when you hook that chip up to main memory with a
40 ns latency and some fixed bandwidth, it may slow right down to
bandwidth limited rates indistinguishable from those of a much slower
chip.

> > For me, I just revel in the Computer Age.  A decade ago, people were
> > predicting all sorts of problems breaking the GHz barrier.  Today CPUs
> > are routinely clocked at 3+ GHz, reaching for 4 and beyond.  A decade
> 
> I just picked up a Semptron 3000+, 1.5GB RAM, 120GB HD, CD-ROM, video,
> 10/100 + intel 1000 Pro for $540 shipped. I was amazed.

The Opterons tend to go for about twice that per CPU, but they are FAST,
especially for their actual clock.  The AMD-64's can be picked up for
about the same and they too are fast.  I haven't really done a complete
benchmark run on the one I own so far, but they look intermediate
between Opteron and everything else, at a much lower price.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu