[Beowulf] Re: vectors vs. loops

Wed Apr 27 12:51:49 PDT 2005

Hi Art:

   Any particular codes you have in mind?  I used to play around with 
lots of DFT (LDA) codes.  Back then, large systems were 256 x 256, with 
periodic BC's.  We used a number of eigensolvers, and eventually settled 
on LAPACK's zheev.  Modeling supercells of much larger than 64 atoms 
with 4 electronic basis states was a challenge using that code.

   Do you have a particular model system in mind as well?  A nice GAMESS 
model (or similar) might work out nicely.  I would like to include some 
electronic structure codes in our (evolving) BBS system.

Joe

Art Edwards wrote:
> This subject is pretty important to us. We run codes where the
> bottleneck is eigensolving for matrices with a few thousand elements.
> Parallel eigen solvers are not impressive at this scale. In the dark
> past, I did a benchmark on a Cray Y-MP using a vector eigen solve and
> got over 100x speedup. What I don't know is how this would compare to
> current compilers and CPU's. However the vector pipes are not very deep
> on any of the current processors except, possibly the PPC. So, I would
> like to see benchmarks of electronic structure codes that are bound by
> eigensolvers on a "true vector" machine. 
> 
> Art Edwards
> 
> On Wed, Apr 27, 2005 at 01:15:42PM -0400, Robert G. Brown wrote:
> 
>>On Wed, 27 Apr 2005, Ben Mayer wrote:
>>
>>
>>>>However, most code doesn't vectorize too well (even, as you say, with
>>>>directives), so people would end up getting 25 MFLOPs out of 300 MFLOPs
>>>>possible -- faster than a desktop, sure, but using a multimillion dollar
>>>>machine to get a factor of MAYBE 10 in speedup compared to (at the time)
>>>>$5-10K machines.
>>>
>>>What the people who run these centers have told me that a
>>>supercomputer is worth the cost if you can get a speed up of 30x over
>>>serial. What do others think of this?
>>
>>I personally think that there is no global answer to this question.
>>There is only cost-benefit analysis.  It is trivially simple to reduce
>>this assertion (by the people who run the centers, who are not exactly
>>unbiased here:-) to absurdity for many, many cases.  In either direction
>>-- for some it might be worth it for a factor of 2 in speedup, for
>>others it might NEVER be worth it at ANY speedup.
>>
>>For example, nearly all common and commercial software isn't worth it at
>>any cost.  If your word processor ran 30x faster, could you tell?  Would
>>you care?  Would it be "worth" the considerable expense of rewriting it
>>for a supercomputer architecture to get a speedup that you could never
>>notice (presuming that one could actually speed it up)?  
>>
>>Sure it's an obvious exception, but the problem with global answers is
>>they brook no exceptions even when there are obvious ones.  If you don't
>>like word processor, pick a suitable rendered computer game (zero
>>productive value, but all sorts of speedup opportunities).  Pick any
>>software with no particular VALUE in the return or with a low
>>OPPORTUNITY COST of the runtime required to run it.
>>
>>A large number of HP computations are in the latter category.  If I want
>>to run a simple simulation that takes eight hours on a serial machine
>>and that I plan to run a single time, is it worth it for me to spend a
>>month recoding it to run in parallel in five minutes?  Obviously not.
>>If you argue that I should include the porting time in the computation
>>of "speedup" then I'd argue that if I have a program that takes two
>>years to run without porting and that takes a six months to port into a
>>form that runs on a supercomputer in six months more, well, a year of MY
>>life is worth it, depending on the actual COST of the "supercomputer"
>>time compared to the serial computer time.  Even in raw dollars, my
>>salary for the extra year is nontrivial compared to the cost of
>>purchasing and installing a brand-new cluster just to speed up the
>>computation by a measley factor of two or four, depending on how you
>>count.
>>
>>So pay no attention to your supercomputer people's pronouncement.  That
>>number (or any other) is pulled out of, uh, their nether regions and is
>>unjustifiable.  Instead, do the cost-benefit analysis, problem by
>>problem, using the best possible estimates you can come up with for the
>>actual costs and benefits.
>>
>>That very few people EVER actually DO this does not mean that it isn't
>>the way it should be done;-)
>>
>>
>>>:) I needed to do some CHARMM runs this summer. The X1 did not like it
>>>much (neither did I, but when the code is making references to punch
>>>cards and you are trying to run it on a vector super, I think most
>>>would feel that way), I ended up running it in parallel by a similar
>>>method as yours. Worked great!
>>
>>The easy way into cluster (or nowadays, "grid") computing, for sure.  If
>>your task is or can be run embarrassingly parallel, well, parallel
>>scaling doesn't generally get much better than a straight line of slope
>>one barring the VERY few problems that exhibit superlinear scaling for
>>some regime....;-)
>>
>>
>>>>If it IS a vector (or nontrivial parallel, or both) task, then the
>>>>problem almost by definition will EITHER require extensive "computer
>>>>science" level study -- work done with Ian Foster's book, Amalsi and
>>>>Gottlieb for parallel and I don't know what for vector as it isn't my
>>>>area of need or expertise and Amazon isn't terribly helpful (most books
>>>>on vector processing deal with obsolete systems or are out of print, it
>>>>seems).
>>>
>>>So what we should really be trying to do is matching code to the
>>>machine. One of the problems that I have run into is that unless one
>>>is at a large center there are only one or two machines that provide
>>>computing power. Where I am from we have a X1 and T3E. Not a very good
>>>choice between the two. There should be a cluster coming up soon,
>>>which will give us the options that we need. ie Vector or Cluster.
>>
>>No, what you SHOULD be doing is matching YOUR code to the cluster you
>>design and build just for that code.  With any luck, the cluster design
>>will be a generic and inexpensive one that can be reused (possibly with
>>minor reconfigurations) for a wide range of parallel problems.  If your
>>problem DOES trivially parallelize, nearly any grid/cluster of OTS
>>computers capable of holding it in memory on (even) sneakernet will give
>>you linear speedup.  
>>
>>Given Cluster World's Really Cheap Cluster as an example, you could
>>conceivably end up with a cluster design containing nodes that cost
>>between $250 and $1000 each, including switches and network and shelving
>>and everything, that can yield linear speedup on your code.  Then you do
>>your cost-benefit analysis, trade off your time, the value of the
>>computation, the value of owning your own hardware and being able to run
>>on it 24x7 without competition, the value of being able to redirect your
>>hardware into other tasks when your main task is idle, any additional
>>costs (power and AC, maybe some systems administration, maintenance).
>>This will usually tell you fairly accurately both whether you should
>>build your own local cluster vs run on a single desktop workstation vs
>>run on a supercomputer at some center and will even tell you how many
>>nodes you can/should buy and in what configuration to get the greatest
>>net benefit.
>>
>>Note that this process is still correct for people who have code that
>>WON'T run efficiently on really cheap node or network hardware; they
>>just have to work harder.  Either way, the most important work is
>>prototyping and benchmarking.  Know your hardware (possibilities) and
>>know your application.  Match up the two, paying attention to how much
>>everything costs and using real world numbers everywhere you can.  AVOID
>>vendor provided numbers, and look upon published benchmark numbers for
>>specific micro or macro benchmarks with deep suspicion unless you really
>>understand the benchmark and trust the source.  For example, you can
>>trust anything >>I<< tell you, of course...;-)
>>
>>
>>>The manual for the X1 provides some information and examples. Are the
>>>Apple G{3,4,5} the only processors who have real vector units? I have
>>>not really looked at SSE(2), but remember that they were not full
>>>precision.
>>
>>What's a "real vector unit"?  On chip?  Off chip?  Add-on board?
>>Integrated with the memory and general purpose CPU (and hence
>>bottlenecked) how?
>>
>>Nearly all CPUs have some degree of vectorization and parallelization
>>available on chip these days; they just tend to hide a lot of it from
>>you.  Compilers work hard to get that benefit out for you in general
>>purpose code, where you don't need to worry about whether or not the
>>unit is "real", only about how long it takes the system to do a stream
>>triad on a vector 10 MB long.  Code portability is a "benefit" vs code
>>specialization is a "cost" when you work out the cost-benefit of making
>>things run on a "real vector unit".  I'd worry more about the times
>>returned by e.g. stream with nothing fancy done to tune it than how
>>"real" the underlying vector architecture is.
>>
>>Also, if your problem DOES trivially parallelize, remember that you have
>>to compare the costs and benefits of complete solutions, in place.  You
>>really have to benchmark the computation, fully optimized for the
>>architecture, on each possible architecture (including systems with
>>"just" SSE but perhaps with 64 bit memory architectures and ATLAS for
>>linear algebra that end up still being competitive) and then compare the
>>COST of those systems to see which one ends up being cheaper.  Remember
>>that bleeding edge systems often charge you a factor of two or more in
>>cost for a stinkin' 20% more performance, so that you're better off
>>buying two cheap systems rather than one really expensive one IF your
>>problem will scale linearly with number of nodes.
>>
>>I personally really like the opteron, and would commend it to people
>>looking for a very good general purpose floating point engine.  I would
>>mistrust vendor benchmarks that claim extreme speedups on vector
>>operations for any code big running out of memory unless the MEMORY is
>>somehow really special.  A Ferrari runs as fast as a Geo on a crowded
>>city street.
>>
>>As always, your best benchmark is your own application, in all its dirty
>>and possibly inefficiently coded state.  The vendor specs may show 30
>>GFLOPS (for just the right code running out of L1 cache or out of
>>on-chip registers) but when you hook that chip up to main memory with a
>>40 ns latency and some fixed bandwidth, it may slow right down to
>>bandwidth limited rates indistinguishable from those of a much slower
>>chip.
>>
>>
>>>>For me, I just revel in the Computer Age.  A decade ago, people were
>>>>predicting all sorts of problems breaking the GHz barrier.  Today CPUs
>>>>are routinely clocked at 3+ GHz, reaching for 4 and beyond.  A decade
>>>
>>>I just picked up a Semptron 3000+, 1.5GB RAM, 120GB HD, CD-ROM, video,
>>>10/100 + intel 1000 Pro for $540 shipped. I was amazed.
>>
>>The Opterons tend to go for about twice that per CPU, but they are FAST,
>>especially for their actual clock.  The AMD-64's can be picked up for
>>about the same and they too are fast.  I haven't really done a complete
>>benchmark run on the one I own so far, but they look intermediate
>>between Opteron and everything else, at a much lower price.
>>
>>   rgb
>>
>>-- 
>>Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>>Duke University Dept. of Physics, Box 90305
>>Durham, N.C. 27708-0305
>>Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>>
>>
>>_______________________________________________
>>Beowulf mailing list, Beowulf at beowulf.org
>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615