[Beowulf] Re: vectors vs. loops
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Art Edwards edwardsa at afrl.kirtland.af.milWed Apr 27 11:52:49 PDT 2005
- Previous message: [Beowulf] Re: vectors vs. loops
- Next message: [Beowulf] Re: vectors vs. loops
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
This subject is pretty important to us. We run codes where the
bottleneck is eigensolving for matrices with a few thousand elements.
Parallel eigen solvers are not impressive at this scale. In the dark
past, I did a benchmark on a Cray Y-MP using a vector eigen solve and
got over 100x speedup. What I don't know is how this would compare to
current compilers and CPU's. However the vector pipes are not very deep
on any of the current processors except, possibly the PPC. So, I would
like to see benchmarks of electronic structure codes that are bound by
eigensolvers on a "true vector" machine.
Art Edwards
On Wed, Apr 27, 2005 at 01:15:42PM -0400, Robert G. Brown wrote:
> On Wed, 27 Apr 2005, Ben Mayer wrote:
>
> > > However, most code doesn't vectorize too well (even, as you say, with
> > > directives), so people would end up getting 25 MFLOPs out of 300 MFLOPs
> > > possible -- faster than a desktop, sure, but using a multimillion dollar
> > > machine to get a factor of MAYBE 10 in speedup compared to (at the time)
> > > $5-10K machines.
> >
> > What the people who run these centers have told me that a
> > supercomputer is worth the cost if you can get a speed up of 30x over
> > serial. What do others think of this?
>
> I personally think that there is no global answer to this question.
> There is only cost-benefit analysis. It is trivially simple to reduce
> this assertion (by the people who run the centers, who are not exactly
> unbiased here:-) to absurdity for many, many cases. In either direction
> -- for some it might be worth it for a factor of 2 in speedup, for
> others it might NEVER be worth it at ANY speedup.
>
> For example, nearly all common and commercial software isn't worth it at
> any cost. If your word processor ran 30x faster, could you tell? Would
> you care? Would it be "worth" the considerable expense of rewriting it
> for a supercomputer architecture to get a speedup that you could never
> notice (presuming that one could actually speed it up)?
>
> Sure it's an obvious exception, but the problem with global answers is
> they brook no exceptions even when there are obvious ones. If you don't
> like word processor, pick a suitable rendered computer game (zero
> productive value, but all sorts of speedup opportunities). Pick any
> software with no particular VALUE in the return or with a low
> OPPORTUNITY COST of the runtime required to run it.
>
> A large number of HP computations are in the latter category. If I want
> to run a simple simulation that takes eight hours on a serial machine
> and that I plan to run a single time, is it worth it for me to spend a
> month recoding it to run in parallel in five minutes? Obviously not.
> If you argue that I should include the porting time in the computation
> of "speedup" then I'd argue that if I have a program that takes two
> years to run without porting and that takes a six months to port into a
> form that runs on a supercomputer in six months more, well, a year of MY
> life is worth it, depending on the actual COST of the "supercomputer"
> time compared to the serial computer time. Even in raw dollars, my
> salary for the extra year is nontrivial compared to the cost of
> purchasing and installing a brand-new cluster just to speed up the
> computation by a measley factor of two or four, depending on how you
> count.
>
> So pay no attention to your supercomputer people's pronouncement. That
> number (or any other) is pulled out of, uh, their nether regions and is
> unjustifiable. Instead, do the cost-benefit analysis, problem by
> problem, using the best possible estimates you can come up with for the
> actual costs and benefits.
>
> That very few people EVER actually DO this does not mean that it isn't
> the way it should be done;-)
>
> > :) I needed to do some CHARMM runs this summer. The X1 did not like it
> > much (neither did I, but when the code is making references to punch
> > cards and you are trying to run it on a vector super, I think most
> > would feel that way), I ended up running it in parallel by a similar
> > method as yours. Worked great!
>
> The easy way into cluster (or nowadays, "grid") computing, for sure. If
> your task is or can be run embarrassingly parallel, well, parallel
> scaling doesn't generally get much better than a straight line of slope
> one barring the VERY few problems that exhibit superlinear scaling for
> some regime....;-)
>
> > > If it IS a vector (or nontrivial parallel, or both) task, then the
> > > problem almost by definition will EITHER require extensive "computer
> > > science" level study -- work done with Ian Foster's book, Amalsi and
> > > Gottlieb for parallel and I don't know what for vector as it isn't my
> > > area of need or expertise and Amazon isn't terribly helpful (most books
> > > on vector processing deal with obsolete systems or are out of print, it
> > > seems).
> >
> > So what we should really be trying to do is matching code to the
> > machine. One of the problems that I have run into is that unless one
> > is at a large center there are only one or two machines that provide
> > computing power. Where I am from we have a X1 and T3E. Not a very good
> > choice between the two. There should be a cluster coming up soon,
> > which will give us the options that we need. ie Vector or Cluster.
>
> No, what you SHOULD be doing is matching YOUR code to the cluster you
> design and build just for that code. With any luck, the cluster design
> will be a generic and inexpensive one that can be reused (possibly with
> minor reconfigurations) for a wide range of parallel problems. If your
> problem DOES trivially parallelize, nearly any grid/cluster of OTS
> computers capable of holding it in memory on (even) sneakernet will give
> you linear speedup.
>
> Given Cluster World's Really Cheap Cluster as an example, you could
> conceivably end up with a cluster design containing nodes that cost
> between $250 and $1000 each, including switches and network and shelving
> and everything, that can yield linear speedup on your code. Then you do
> your cost-benefit analysis, trade off your time, the value of the
> computation, the value of owning your own hardware and being able to run
> on it 24x7 without competition, the value of being able to redirect your
> hardware into other tasks when your main task is idle, any additional
> costs (power and AC, maybe some systems administration, maintenance).
> This will usually tell you fairly accurately both whether you should
> build your own local cluster vs run on a single desktop workstation vs
> run on a supercomputer at some center and will even tell you how many
> nodes you can/should buy and in what configuration to get the greatest
> net benefit.
>
> Note that this process is still correct for people who have code that
> WON'T run efficiently on really cheap node or network hardware; they
> just have to work harder. Either way, the most important work is
> prototyping and benchmarking. Know your hardware (possibilities) and
> know your application. Match up the two, paying attention to how much
> everything costs and using real world numbers everywhere you can. AVOID
> vendor provided numbers, and look upon published benchmark numbers for
> specific micro or macro benchmarks with deep suspicion unless you really
> understand the benchmark and trust the source. For example, you can
> trust anything >>I<< tell you, of course...;-)
>
> > The manual for the X1 provides some information and examples. Are the
> > Apple G{3,4,5} the only processors who have real vector units? I have
> > not really looked at SSE(2), but remember that they were not full
> > precision.
>
> What's a "real vector unit"? On chip? Off chip? Add-on board?
> Integrated with the memory and general purpose CPU (and hence
> bottlenecked) how?
>
> Nearly all CPUs have some degree of vectorization and parallelization
> available on chip these days; they just tend to hide a lot of it from
> you. Compilers work hard to get that benefit out for you in general
> purpose code, where you don't need to worry about whether or not the
> unit is "real", only about how long it takes the system to do a stream
> triad on a vector 10 MB long. Code portability is a "benefit" vs code
> specialization is a "cost" when you work out the cost-benefit of making
> things run on a "real vector unit". I'd worry more about the times
> returned by e.g. stream with nothing fancy done to tune it than how
> "real" the underlying vector architecture is.
>
> Also, if your problem DOES trivially parallelize, remember that you have
> to compare the costs and benefits of complete solutions, in place. You
> really have to benchmark the computation, fully optimized for the
> architecture, on each possible architecture (including systems with
> "just" SSE but perhaps with 64 bit memory architectures and ATLAS for
> linear algebra that end up still being competitive) and then compare the
> COST of those systems to see which one ends up being cheaper. Remember
> that bleeding edge systems often charge you a factor of two or more in
> cost for a stinkin' 20% more performance, so that you're better off
> buying two cheap systems rather than one really expensive one IF your
> problem will scale linearly with number of nodes.
>
> I personally really like the opteron, and would commend it to people
> looking for a very good general purpose floating point engine. I would
> mistrust vendor benchmarks that claim extreme speedups on vector
> operations for any code big running out of memory unless the MEMORY is
> somehow really special. A Ferrari runs as fast as a Geo on a crowded
> city street.
>
> As always, your best benchmark is your own application, in all its dirty
> and possibly inefficiently coded state. The vendor specs may show 30
> GFLOPS (for just the right code running out of L1 cache or out of
> on-chip registers) but when you hook that chip up to main memory with a
> 40 ns latency and some fixed bandwidth, it may slow right down to
> bandwidth limited rates indistinguishable from those of a much slower
> chip.
>
> > > For me, I just revel in the Computer Age. A decade ago, people were
> > > predicting all sorts of problems breaking the GHz barrier. Today CPUs
> > > are routinely clocked at 3+ GHz, reaching for 4 and beyond. A decade
> >
> > I just picked up a Semptron 3000+, 1.5GB RAM, 120GB HD, CD-ROM, video,
> > 10/100 + intel 1000 Pro for $540 shipped. I was amazed.
>
> The Opterons tend to go for about twice that per CPU, but they are FAST,
> especially for their actual clock. The AMD-64's can be picked up for
> about the same and they too are fast. I haven't really done a complete
> benchmark run on the one I own so far, but they look intermediate
> between Opteron and everything else, at a much lower price.
>
> rgb
>
> --
> Robert G. Brown http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
--
Art Edwards
Senior Research Physicist
Air Force Research Laboratory
Electronics Foundations Branch
KAFB, New Mexico
(505) 853-6042 (v)
(505) 846-2290 (f)
- Previous message: [Beowulf] Re: vectors vs. loops
- Next message: [Beowulf] Re: vectors vs. loops
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
