[Beowulf] Re: vectors vs. loops

Thu Apr 28 07:46:32 PDT 2005

In message from Joe Landman <landman at scalableinformatics.com> (Wed, 27 
Apr 2005 15:51:49 -0400):
>Hi Art:
>
>   Any particular codes you have in mind?  I used to play around with 
>lots of DFT (LDA) codes.  Back then, large systems were 256 x 256, 
>with periodic BC's.
Most (practically all) DFT codes are not limited by eigenvalues
problem. The limiting stage is computation of 2-electron integrals and
fockian.

Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow  

> We used a number of eigensolvers, and eventually 
>settled on LAPACK's zheev.  Modeling supercells of much larger than 
>64 atoms with 4 electronic basis states was a challenge using that 
>code.
>
>   Do you have a particular model system in mind as well?  A nice 
>GAMESS 
>model (or similar) might work out nicely.  I would like to include 
>some electronic structure codes in our (evolving) BBS system.
>
>Joe
>
>Art Edwards wrote:
>> This subject is pretty important to us. We run codes where the
>> bottleneck is eigensolving for matrices with a few thousand 
>>elements.
>> Parallel eigen solvers are not impressive at this scale. In the dark
>> past, I did a benchmark on a Cray Y-MP using a vector eigen solve 
>>and
>> got over 100x speedup. What I don't know is how this would compare 
>>to
>> current compilers and CPU's. However the vector pipes are not very 
>>deep
>> on any of the current processors except, possibly the PPC. So, I 
>>would
>> like to see benchmarks of electronic structure codes that are bound 
>>by
>> eigensolvers on a "true vector" machine. 
>> 
>> Art Edwards
>> 
>> On Wed, Apr 27, 2005 at 01:15:42PM -0400, Robert G. Brown wrote:
>> 
>>>On Wed, 27 Apr 2005, Ben Mayer wrote:
>>>
>>>
>>>>>However, most code doesn't vectorize too well (even, as you say, with
>>>>>directives), so people would end up getting 25 MFLOPs out of 300 
>>>>>MFLOPs
>>>>>possible -- faster than a desktop, sure, but using a multimillion 
>>>>>dollar
>>>>>machine to get a factor of MAYBE 10 in speedup compared to (at the 
>>>>>time)
>>>>>$5-10K machines.
>>>>
>>>>What the people who run these centers have told me that a
>>>>supercomputer is worth the cost if you can get a speed up of 30x over
>>>>serial. What do others think of this?
>>>
>>>I personally think that there is no global answer to this question.
>>>There is only cost-benefit analysis.  It is trivially simple to 
>>>reduce
>>>this assertion (by the people who run the centers, who are not 
>>>exactly
>>>unbiased here:-) to absurdity for many, many cases.  In either 
>>>direction
>>>-- for some it might be worth it for a factor of 2 in speedup, for
>>>others it might NEVER be worth it at ANY speedup.
>>>
>>>For example, nearly all common and commercial software isn't worth it 
>>>at
>>>any cost.  If your word processor ran 30x faster, could you tell? 
>>> Would
>>>you care?  Would it be "worth" the considerable expense of rewriting 
>>>it
>>>for a supercomputer architecture to get a speedup that you could 
>>>never
>>>notice (presuming that one could actually speed it up)?  
>>>Sure it's an obvious exception, but the problem with global answers 
>>>is
>>>they brook no exceptions even when there are obvious ones.  If you 
>>>don't
>>>like word processor, pick a suitable rendered computer game (zero
>>>productive value, but all sorts of speedup opportunities).  Pick any
>>>software with no particular VALUE in the return or with a low
>>>OPPORTUNITY COST of the runtime required to run it.
>>>
>>>A large number of HP computations are in the latter category.  If I 
>>>want
>>>to run a simple simulation that takes eight hours on a serial machine
>>>and that I plan to run a single time, is it worth it for me to spend 
>>>a
>>>month recoding it to run in parallel in five minutes?  Obviously not.
>>>If you argue that I should include the porting time in the 
>>>computation
>>>of "speedup" then I'd argue that if I have a program that takes two
>>>years to run without porting and that takes a six months to port into 
>>>a
>>>form that runs on a supercomputer in six months more, well, a year of 
>>>MY
>>>life is worth it, depending on the actual COST of the "supercomputer"
>>>time compared to the serial computer time.  Even in raw dollars, my
>>>salary for the extra year is nontrivial compared to the cost of
>>>purchasing and installing a brand-new cluster just to speed up the
>>>computation by a measley factor of two or four, depending on how you
>>>count.
>>>
>>>So pay no attention to your supercomputer people's pronouncement. 
>>> That
>>>number (or any other) is pulled out of, uh, their nether regions and 
>>>is
>>>unjustifiable.  Instead, do the cost-benefit analysis, problem by
>>>problem, using the best possible estimates you can come up with for 
>>>the
>>>actual costs and benefits.
>>>
>>>That very few people EVER actually DO this does not mean that it 
>>>isn't
>>>the way it should be done;-)
>>>
>>>
>>>>:) I needed to do some CHARMM runs this summer. The X1 did not like 
>>>>it
>>>>much (neither did I, but when the code is making references to punch
>>>>cards and you are trying to run it on a vector super, I think most
>>>>would feel that way), I ended up running it in parallel by a similar
>>>>method as yours. Worked great!
>>>
>>>The easy way into cluster (or nowadays, "grid") computing, for sure. 
>>> If
>>>your task is or can be run embarrassingly parallel, well, parallel
>>>scaling doesn't generally get much better than a straight line of 
>>>slope
>>>one barring the VERY few problems that exhibit superlinear scaling 
>>>for
>>>some regime....;-)
>>>
>>>
>>>>>If it IS a vector (or nontrivial parallel, or both) task, then the
>>>>>problem almost by definition will EITHER require extensive "computer
>>>>>science" level study -- work done with Ian Foster's book, Amalsi and
>>>>>Gottlieb for parallel and I don't know what for vector as it isn't my
>>>>>area of need or expertise and Amazon isn't terribly helpful (most 
>>>>>books
>>>>>on vector processing deal with obsolete systems or are out of print, 
>>>>>it
>>>>>seems).
>>>>
>>>>So what we should really be trying to do is matching code to the
>>>>machine. One of the problems that I have run into is that unless one
>>>>is at a large center there are only one or two machines that provide
>>>>computing power. Where I am from we have a X1 and T3E. Not a very 
>>>>good
>>>>choice between the two. There should be a cluster coming up soon,
>>>>which will give us the options that we need. ie Vector or Cluster.
>>>
>>>No, what you SHOULD be doing is matching YOUR code to the cluster you
>>>design and build just for that code.  With any luck, the cluster 
>>>design
>>>will be a generic and inexpensive one that can be reused (possibly 
>>>with
>>>minor reconfigurations) for a wide range of parallel problems.  If 
>>>your
>>>problem DOES trivially parallelize, nearly any grid/cluster of OTS
>>>computers capable of holding it in memory on (even) sneakernet will 
>>>give
>>>you linear speedup.  
>>>Given Cluster World's Really Cheap Cluster as an example, you could
>>>conceivably end up with a cluster design containing nodes that cost
>>>between $250 and $1000 each, including switches and network and 
>>>shelving
>>>and everything, that can yield linear speedup on your code.  Then you 
>>>do
>>>your cost-benefit analysis, trade off your time, the value of the
>>>computation, the value of owning your own hardware and being able to 
>>>run
>>>on it 24x7 without competition, the value of being able to redirect 
>>>your
>>>hardware into other tasks when your main task is idle, any additional
>>>costs (power and AC, maybe some systems administration, maintenance).
>>>This will usually tell you fairly accurately both whether you should
>>>build your own local cluster vs run on a single desktop workstation 
>>>vs
>>>run on a supercomputer at some center and will even tell you how many
>>>nodes you can/should buy and in what configuration to get the 
>>>greatest
>>>net benefit.
>>>
>>>Note that this process is still correct for people who have code that
>>>WON'T run efficiently on really cheap node or network hardware; they
>>>just have to work harder.  Either way, the most important work is
>>>prototyping and benchmarking.  Know your hardware (possibilities) and
>>>know your application.  Match up the two, paying attention to how 
>>>much
>>>everything costs and using real world numbers everywhere you can. 
>>> AVOID
>>>vendor provided numbers, and look upon published benchmark numbers 
>>>for
>>>specific micro or macro benchmarks with deep suspicion unless you 
>>>really
>>>understand the benchmark and trust the source.  For example, you can
>>>trust anything >>I<< tell you, of course...;-)
>>>
>>>
>>>>The manual for the X1 provides some information and examples. Are the
>>>>Apple G{3,4,5} the only processors who have real vector units? I have
>>>>not really looked at SSE(2), but remember that they were not full
>>>>precision.
>>>
>>>What's a "real vector unit"?  On chip?  Off chip?  Add-on board?
>>>Integrated with the memory and general purpose CPU (and hence
>>>bottlenecked) how?
>>>
>>>Nearly all CPUs have some degree of vectorization and parallelization
>>>available on chip these days; they just tend to hide a lot of it from
>>>you.  Compilers work hard to get that benefit out for you in general
>>>purpose code, where you don't need to worry about whether or not the
>>>unit is "real", only about how long it takes the system to do a 
>>>stream
>>>triad on a vector 10 MB long.  Code portability is a "benefit" vs 
>>>code
>>>specialization is a "cost" when you work out the cost-benefit of 
>>>making
>>>things run on a "real vector unit".  I'd worry more about the times
>>>returned by e.g. stream with nothing fancy done to tune it than how
>>>"real" the underlying vector architecture is.
>>>
>>>Also, if your problem DOES trivially parallelize, remember that you 
>>>have
>>>to compare the costs and benefits of complete solutions, in place. 
>>> You
>>>really have to benchmark the computation, fully optimized for the
>>>architecture, on each possible architecture (including systems with
>>>"just" SSE but perhaps with 64 bit memory architectures and ATLAS for
>>>linear algebra that end up still being competitive) and then compare 
>>>the
>>>COST of those systems to see which one ends up being cheaper. 
>>> Remember
>>>that bleeding edge systems often charge you a factor of two or more 
>>>in
>>>cost for a stinkin' 20% more performance, so that you're better off
>>>buying two cheap systems rather than one really expensive one IF your
>>>problem will scale linearly with number of nodes.
>>>
>>>I personally really like the opteron, and would commend it to people
>>>looking for a very good general purpose floating point engine.  I 
>>>would
>>>mistrust vendor benchmarks that claim extreme speedups on vector
>>>operations for any code big running out of memory unless the MEMORY 
>>>is
>>>somehow really special.  A Ferrari runs as fast as a Geo on a crowded
>>>city street.
>>>
>>>As always, your best benchmark is your own application, in all its 
>>>dirty
>>>and possibly inefficiently coded state.  The vendor specs may show 30
>>>GFLOPS (for just the right code running out of L1 cache or out of
>>>on-chip registers) but when you hook that chip up to main memory with 
>>>a
>>>40 ns latency and some fixed bandwidth, it may slow right down to
>>>bandwidth limited rates indistinguishable from those of a much slower
>>>chip.
>>>
>>>
>>>>>For me, I just revel in the Computer Age.  A decade ago, people were
>>>>>predicting all sorts of problems breaking the GHz barrier.  Today 
>>>>>CPUs
>>>>>are routinely clocked at 3+ GHz, reaching for 4 and beyond.  A decade
>>>>
>>>>I just picked up a Semptron 3000+, 1.5GB RAM, 120GB HD, CD-ROM, 
>>>>video,
>>>>10/100 + intel 1000 Pro for $540 shipped. I was amazed.
>>>
>>>The Opterons tend to go for about twice that per CPU, but they are 
>>>FAST,
>>>especially for their actual clock.  The AMD-64's can be picked up for
>>>about the same and they too are fast.  I haven't really done a 
>>>complete
>>>benchmark run on the one I own so far, but they look intermediate
>>>between Opteron and everything else, at a much lower price.
>>>
>>>   rgb
>>>
>>>-- 
>>>Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
>>>Duke University Dept. of Physics, Box 90305
>>>Durham, N.C. 27708-0305
>>>Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
>>>
>>>
>>>_______________________________________________
>>>Beowulf mailing list, Beowulf at beowulf.org
>>>To change your subscription (digest mode or unsubscribe) visit 
>>>http://www.beowulf.org/mailman/listinfo/beowulf
>> 
>> 
>
>-- 
>Joseph Landman, Ph.D
>Founder and CEO
>Scalable Informatics LLC,
>email: landman at scalableinformatics.com
>web  : http://www.scalableinformatics.com
>phone: +1 734 786 8423
>fax  : +1 734 786 8452
>cell : +1 734 612 4615
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf