[Beowulf] Re: vectors vs. loops

Art Edwards edwardsa at afrl.kirtland.af.mil
Wed Apr 27 11:52:49 PDT 2005

This subject is pretty important to us. We run codes where the
bottleneck is eigensolving for matrices with a few thousand elements.
Parallel eigen solvers are not impressive at this scale. In the dark
past, I did a benchmark on a Cray Y-MP using a vector eigen solve and
got over 100x speedup. What I don't know is how this would compare to
current compilers and CPU's. However the vector pipes are not very deep
on any of the current processors except, possibly the PPC. So, I would
like to see benchmarks of electronic structure codes that are bound by
eigensolvers on a "true vector" machine. 

Art Edwards

On Wed, Apr 27, 2005 at 01:15:42PM -0400, Robert G. Brown wrote:
> On Wed, 27 Apr 2005, Ben Mayer wrote:
> > > However, most code doesn't vectorize too well (even, as you say, with
> > > directives), so people would end up getting 25 MFLOPs out of 300 MFLOPs
> > > possible -- faster than a desktop, sure, but using a multimillion dollar
> > > machine to get a factor of MAYBE 10 in speedup compared to (at the time)
> > > $5-10K machines.
> > 
> > What the people who run these centers have told me that a
> > supercomputer is worth the cost if you can get a speed up of 30x over
> > serial. What do others think of this?
> I personally think that there is no global answer to this question.
> There is only cost-benefit analysis.  It is trivially simple to reduce
> this assertion (by the people who run the centers, who are not exactly
> unbiased here:-) to absurdity for many, many cases.  In either direction
> -- for some it might be worth it for a factor of 2 in speedup, for
> others it might NEVER be worth it at ANY speedup.
> For example, nearly all common and commercial software isn't worth it at
> any cost.  If your word processor ran 30x faster, could you tell?  Would
> you care?  Would it be "worth" the considerable expense of rewriting it
> for a supercomputer architecture to get a speedup that you could never
> notice (presuming that one could actually speed it up)?  
> Sure it's an obvious exception, but the problem with global answers is
> they brook no exceptions even when there are obvious ones.  If you don't
> like word processor, pick a suitable rendered computer game (zero
> productive value, but all sorts of speedup opportunities).  Pick any
> software with no particular VALUE in the return or with a low
> OPPORTUNITY COST of the runtime required to run it.
> A large number of HP computations are in the latter category.  If I want
> to run a simple simulation that takes eight hours on a serial machine
> and that I plan to run a single time, is it worth it for me to spend a
> month recoding it to run in parallel in five minutes?  Obviously not.
> If you argue that I should include the porting time in the computation
> of "speedup" then I'd argue that if I have a program that takes two
> years to run without porting and that takes a six months to port into a
> form that runs on a supercomputer in six months more, well, a year of MY
> life is worth it, depending on the actual COST of the "supercomputer"
> time compared to the serial computer time.  Even in raw dollars, my
> salary for the extra year is nontrivial compared to the cost of
> purchasing and installing a brand-new cluster just to speed up the
> computation by a measley factor of two or four, depending on how you
> count.
> So pay no attention to your supercomputer people's pronouncement.  That
> number (or any other) is pulled out of, uh, their nether regions and is
> unjustifiable.  Instead, do the cost-benefit analysis, problem by
> problem, using the best possible estimates you can come up with for the
> actual costs and benefits.
> That very few people EVER actually DO this does not mean that it isn't
> the way it should be done;-)
> > :) I needed to do some CHARMM runs this summer. The X1 did not like it
> > much (neither did I, but when the code is making references to punch
> > cards and you are trying to run it on a vector super, I think most
> > would feel that way), I ended up running it in parallel by a similar
> > method as yours. Worked great!
> The easy way into cluster (or nowadays, "grid") computing, for sure.  If
> your task is or can be run embarrassingly parallel, well, parallel
> scaling doesn't generally get much better than a straight line of slope
> one barring the VERY few problems that exhibit superlinear scaling for
> some regime....;-)
> > > If it IS a vector (or nontrivial parallel, or both) task, then the
> > > problem almost by definition will EITHER require extensive "computer
> > > science" level study -- work done with Ian Foster's book, Amalsi and
> > > Gottlieb for parallel and I don't know what for vector as it isn't my
> > > area of need or expertise and Amazon isn't terribly helpful (most books
> > > on vector processing deal with obsolete systems or are out of print, it
> > > seems).
> > 
> > So what we should really be trying to do is matching code to the
> > machine. One of the problems that I have run into is that unless one
> > is at a large center there are only one or two machines that provide
> > computing power. Where I am from we have a X1 and T3E. Not a very good
> > choice between the two. There should be a cluster coming up soon,
> > which will give us the options that we need. ie Vector or Cluster.
> No, what you SHOULD be doing is matching YOUR code to the cluster you
> design and build just for that code.  With any luck, the cluster design
> will be a generic and inexpensive one that can be reused (possibly with
> minor reconfigurations) for a wide range of parallel problems.  If your
> problem DOES trivially parallelize, nearly any grid/cluster of OTS
> computers capable of holding it in memory on (even) sneakernet will give
> you linear speedup.  
> Given Cluster World's Really Cheap Cluster as an example, you could
> conceivably end up with a cluster design containing nodes that cost
> between $250 and $1000 each, including switches and network and shelving
> and everything, that can yield linear speedup on your code.  Then you do
> your cost-benefit analysis, trade off your time, the value of the
> computation, the value of owning your own hardware and being able to run
> on it 24x7 without competition, the value of being able to redirect your
> hardware into other tasks when your main task is idle, any additional
> costs (power and AC, maybe some systems administration, maintenance).
> This will usually tell you fairly accurately both whether you should
> build your own local cluster vs run on a single desktop workstation vs
> run on a supercomputer at some center and will even tell you how many
> nodes you can/should buy and in what configuration to get the greatest
> net benefit.
> Note that this process is still correct for people who have code that
> WON'T run efficiently on really cheap node or network hardware; they
> just have to work harder.  Either way, the most important work is
> prototyping and benchmarking.  Know your hardware (possibilities) and
> know your application.  Match up the two, paying attention to how much
> everything costs and using real world numbers everywhere you can.  AVOID
> vendor provided numbers, and look upon published benchmark numbers for
> specific micro or macro benchmarks with deep suspicion unless you really
> understand the benchmark and trust the source.  For example, you can
> trust anything >>I<< tell you, of course...;-)
> > The manual for the X1 provides some information and examples. Are the
> > Apple G{3,4,5} the only processors who have real vector units? I have
> > not really looked at SSE(2), but remember that they were not full
> > precision.
> What's a "real vector unit"?  On chip?  Off chip?  Add-on board?
> Integrated with the memory and general purpose CPU (and hence
> bottlenecked) how?
> Nearly all CPUs have some degree of vectorization and parallelization
> available on chip these days; they just tend to hide a lot of it from
> you.  Compilers work hard to get that benefit out for you in general
> purpose code, where you don't need to worry about whether or not the
> unit is "real", only about how long it takes the system to do a stream
> triad on a vector 10 MB long.  Code portability is a "benefit" vs code
> specialization is a "cost" when you work out the cost-benefit of making
> things run on a "real vector unit".  I'd worry more about the times
> returned by e.g. stream with nothing fancy done to tune it than how
> "real" the underlying vector architecture is.
> Also, if your problem DOES trivially parallelize, remember that you have
> to compare the costs and benefits of complete solutions, in place.  You
> really have to benchmark the computation, fully optimized for the
> architecture, on each possible architecture (including systems with
> "just" SSE but perhaps with 64 bit memory architectures and ATLAS for
> linear algebra that end up still being competitive) and then compare the
> COST of those systems to see which one ends up being cheaper.  Remember
> that bleeding edge systems often charge you a factor of two or more in
> cost for a stinkin' 20% more performance, so that you're better off
> buying two cheap systems rather than one really expensive one IF your
> problem will scale linearly with number of nodes.
> I personally really like the opteron, and would commend it to people
> looking for a very good general purpose floating point engine.  I would
> mistrust vendor benchmarks that claim extreme speedups on vector
> operations for any code big running out of memory unless the MEMORY is
> somehow really special.  A Ferrari runs as fast as a Geo on a crowded
> city street.
> As always, your best benchmark is your own application, in all its dirty
> and possibly inefficiently coded state.  The vendor specs may show 30
> GFLOPS (for just the right code running out of L1 cache or out of
> on-chip registers) but when you hook that chip up to main memory with a
> 40 ns latency and some fixed bandwidth, it may slow right down to
> bandwidth limited rates indistinguishable from those of a much slower
> chip.
> > > For me, I just revel in the Computer Age.  A decade ago, people were
> > > predicting all sorts of problems breaking the GHz barrier.  Today CPUs
> > > are routinely clocked at 3+ GHz, reaching for 4 and beyond.  A decade
> > 
> > I just picked up a Semptron 3000+, 1.5GB RAM, 120GB HD, CD-ROM, video,
> > 10/100 + intel 1000 Pro for $540 shipped. I was amazed.
> The Opterons tend to go for about twice that per CPU, but they are FAST,
> especially for their actual clock.  The AMD-64's can be picked up for
> about the same and they too are fast.  I haven't really done a complete
> benchmark run on the one I own so far, but they look intermediate
> between Opteron and everything else, at a much lower price.
>    rgb
> -- 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Art Edwards
Senior Research Physicist
Air Force Research Laboratory
Electronics Foundations Branch
KAFB, New Mexico

(505) 853-6042 (v)
(505) 846-2290 (f)

More information about the Beowulf mailing list