[Beowulf] Re: vectors vs. loops
Mikhail Kuzminsky
kus at free.net
Thu Apr 28 07:46:32 PDT 2005
In message from Joe Landman <landman at scalableinformatics.com> (Wed, 27
Apr 2005 15:51:49 -0400):
>Hi Art:
>
> Any particular codes you have in mind? I used to play around with
>lots of DFT (LDA) codes. Back then, large systems were 256 x 256,
>with periodic BC's.
Most (practically all) DFT codes are not limited by eigenvalues
problem. The limiting stage is computation of 2-electron integrals and
fockian.
Yours
Mikhail Kuzminsky
Zelinsky Institute of Organic Chemistry
Moscow
> We used a number of eigensolvers, and eventually
>settled on LAPACK's zheev. Modeling supercells of much larger than
>64 atoms with 4 electronic basis states was a challenge using that
>code.
>
> Do you have a particular model system in mind as well? A nice
>GAMESS
>model (or similar) might work out nicely. I would like to include
>some electronic structure codes in our (evolving) BBS system.
>
>Joe
>
>Art Edwards wrote:
>> This subject is pretty important to us. We run codes where the
>> bottleneck is eigensolving for matrices with a few thousand
>>elements.
>> Parallel eigen solvers are not impressive at this scale. In the dark
>> past, I did a benchmark on a Cray Y-MP using a vector eigen solve
>>and
>> got over 100x speedup. What I don't know is how this would compare
>>to
>> current compilers and CPU's. However the vector pipes are not very
>>deep
>> on any of the current processors except, possibly the PPC. So, I
>>would
>> like to see benchmarks of electronic structure codes that are bound
>>by
>> eigensolvers on a "true vector" machine.
>>
>> Art Edwards
>>
>> On Wed, Apr 27, 2005 at 01:15:42PM -0400, Robert G. Brown wrote:
>>
>>>On Wed, 27 Apr 2005, Ben Mayer wrote:
>>>
>>>
>>>>>However, most code doesn't vectorize too well (even, as you say, with
>>>>>directives), so people would end up getting 25 MFLOPs out of 300
>>>>>MFLOPs
>>>>>possible -- faster than a desktop, sure, but using a multimillion
>>>>>dollar
>>>>>machine to get a factor of MAYBE 10 in speedup compared to (at the
>>>>>time)
>>>>>$5-10K machines.
>>>>
>>>>What the people who run these centers have told me that a
>>>>supercomputer is worth the cost if you can get a speed up of 30x over
>>>>serial. What do others think of this?
>>>
>>>I personally think that there is no global answer to this question.
>>>There is only cost-benefit analysis. It is trivially simple to
>>>reduce
>>>this assertion (by the people who run the centers, who are not
>>>exactly
>>>unbiased here:-) to absurdity for many, many cases. In either
>>>direction
>>>-- for some it might be worth it for a factor of 2 in speedup, for
>>>others it might NEVER be worth it at ANY speedup.
>>>
>>>For example, nearly all common and commercial software isn't worth it
>>>at
>>>any cost. If your word processor ran 30x faster, could you tell?
>>> Would
>>>you care? Would it be "worth" the considerable expense of rewriting
>>>it
>>>for a supercomputer architecture to get a speedup that you could
>>>never
>>>notice (presuming that one could actually speed it up)?
>>>Sure it's an obvious exception, but the problem with global answers
>>>is
>>>they brook no exceptions even when there are obvious ones. If you
>>>don't
>>>like word processor, pick a suitable rendered computer game (zero
>>>productive value, but all sorts of speedup opportunities). Pick any
>>>software with no particular VALUE in the return or with a low
>>>OPPORTUNITY COST of the runtime required to run it.
>>>
>>>A large number of HP computations are in the latter category. If I
>>>want
>>>to run a simple simulation that takes eight hours on a serial machine
>>>and that I plan to run a single time, is it worth it for me to spend
>>>a
>>>month recoding it to run in parallel in five minutes? Obviously not.
>>>If you argue that I should include the porting time in the
>>>computation
>>>of "speedup" then I'd argue that if I have a program that takes two
>>>years to run without porting and that takes a six months to port into
>>>a
>>>form that runs on a supercomputer in six months more, well, a year of
>>>MY
>>>life is worth it, depending on the actual COST of the "supercomputer"
>>>time compared to the serial computer time. Even in raw dollars, my
>>>salary for the extra year is nontrivial compared to the cost of
>>>purchasing and installing a brand-new cluster just to speed up the
>>>computation by a measley factor of two or four, depending on how you
>>>count.
>>>
>>>So pay no attention to your supercomputer people's pronouncement.
>>> That
>>>number (or any other) is pulled out of, uh, their nether regions and
>>>is
>>>unjustifiable. Instead, do the cost-benefit analysis, problem by
>>>problem, using the best possible estimates you can come up with for
>>>the
>>>actual costs and benefits.
>>>
>>>That very few people EVER actually DO this does not mean that it
>>>isn't
>>>the way it should be done;-)
>>>
>>>
>>>>:) I needed to do some CHARMM runs this summer. The X1 did not like
>>>>it
>>>>much (neither did I, but when the code is making references to punch
>>>>cards and you are trying to run it on a vector super, I think most
>>>>would feel that way), I ended up running it in parallel by a similar
>>>>method as yours. Worked great!
>>>
>>>The easy way into cluster (or nowadays, "grid") computing, for sure.
>>> If
>>>your task is or can be run embarrassingly parallel, well, parallel
>>>scaling doesn't generally get much better than a straight line of
>>>slope
>>>one barring the VERY few problems that exhibit superlinear scaling
>>>for
>>>some regime....;-)
>>>
>>>
>>>>>If it IS a vector (or nontrivial parallel, or both) task, then the
>>>>>problem almost by definition will EITHER require extensive "computer
>>>>>science" level study -- work done with Ian Foster's book, Amalsi and
>>>>>Gottlieb for parallel and I don't know what for vector as it isn't my
>>>>>area of need or expertise and Amazon isn't terribly helpful (most
>>>>>books
>>>>>on vector processing deal with obsolete systems or are out of print,
>>>>>it
>>>>>seems).
>>>>
>>>>So what we should really be trying to do is matching code to the
>>>>machine. One of the problems that I have run into is that unless one
>>>>is at a large center there are only one or two machines that provide
>>>>computing power. Where I am from we have a X1 and T3E. Not a very
>>>>good
>>>>choice between the two. There should be a cluster coming up soon,
>>>>which will give us the options that we need. ie Vector or Cluster.
>>>
>>>No, what you SHOULD be doing is matching YOUR code to the cluster you
>>>design and build just for that code. With any luck, the cluster
>>>design
>>>will be a generic and inexpensive one that can be reused (possibly
>>>with
>>>minor reconfigurations) for a wide range of parallel problems. If
>>>your
>>>problem DOES trivially parallelize, nearly any grid/cluster of OTS
>>>computers capable of holding it in memory on (even) sneakernet will
>>>give
>>>you linear speedup.
>>>Given Cluster World's Really Cheap Cluster as an example, you could
>>>conceivably end up with a cluster design containing nodes that cost
>>>between $250 and $1000 each, including switches and network and
>>>shelving
>>>and everything, that can yield linear speedup on your code. Then you
>>>do
>>>your cost-benefit analysis, trade off your time, the value of the
>>>computation, the value of owning your own hardware and being able to
>>>run
>>>on it 24x7 without competition, the value of being able to redirect
>>>your
>>>hardware into other tasks when your main task is idle, any additional
>>>costs (power and AC, maybe some systems administration, maintenance).
>>>This will usually tell you fairly accurately both whether you should
>>>build your own local cluster vs run on a single desktop workstation
>>>vs
>>>run on a supercomputer at some center and will even tell you how many
>>>nodes you can/should buy and in what configuration to get the
>>>greatest
>>>net benefit.
>>>
>>>Note that this process is still correct for people who have code that
>>>WON'T run efficiently on really cheap node or network hardware; they
>>>just have to work harder. Either way, the most important work is
>>>prototyping and benchmarking. Know your hardware (possibilities) and
>>>know your application. Match up the two, paying attention to how
>>>much
>>>everything costs and using real world numbers everywhere you can.
>>> AVOID
>>>vendor provided numbers, and look upon published benchmark numbers
>>>for
>>>specific micro or macro benchmarks with deep suspicion unless you
>>>really
>>>understand the benchmark and trust the source. For example, you can
>>>trust anything >>I<< tell you, of course...;-)
>>>
>>>
>>>>The manual for the X1 provides some information and examples. Are the
>>>>Apple G{3,4,5} the only processors who have real vector units? I have
>>>>not really looked at SSE(2), but remember that they were not full
>>>>precision.
>>>
>>>What's a "real vector unit"? On chip? Off chip? Add-on board?
>>>Integrated with the memory and general purpose CPU (and hence
>>>bottlenecked) how?
>>>
>>>Nearly all CPUs have some degree of vectorization and parallelization
>>>available on chip these days; they just tend to hide a lot of it from
>>>you. Compilers work hard to get that benefit out for you in general
>>>purpose code, where you don't need to worry about whether or not the
>>>unit is "real", only about how long it takes the system to do a
>>>stream
>>>triad on a vector 10 MB long. Code portability is a "benefit" vs
>>>code
>>>specialization is a "cost" when you work out the cost-benefit of
>>>making
>>>things run on a "real vector unit". I'd worry more about the times
>>>returned by e.g. stream with nothing fancy done to tune it than how
>>>"real" the underlying vector architecture is.
>>>
>>>Also, if your problem DOES trivially parallelize, remember that you
>>>have
>>>to compare the costs and benefits of complete solutions, in place.
>>> You
>>>really have to benchmark the computation, fully optimized for the
>>>architecture, on each possible architecture (including systems with
>>>"just" SSE but perhaps with 64 bit memory architectures and ATLAS for
>>>linear algebra that end up still being competitive) and then compare
>>>the
>>>COST of those systems to see which one ends up being cheaper.
>>> Remember
>>>that bleeding edge systems often charge you a factor of two or more
>>>in
>>>cost for a stinkin' 20% more performance, so that you're better off
>>>buying two cheap systems rather than one really expensive one IF your
>>>problem will scale linearly with number of nodes.
>>>
>>>I personally really like the opteron, and would commend it to people
>>>looking for a very good general purpose floating point engine. I
>>>would
>>>mistrust vendor benchmarks that claim extreme speedups on vector
>>>operations for any code big running out of memory unless the MEMORY
>>>is
>>>somehow really special. A Ferrari runs as fast as a Geo on a crowded
>>>city street.
>>>
>>>As always, your best benchmark is your own application, in all its
>>>dirty
>>>and possibly inefficiently coded state. The vendor specs may show 30
>>>GFLOPS (for just the right code running out of L1 cache or out of
>>>on-chip registers) but when you hook that chip up to main memory with
>>>a
>>>40 ns latency and some fixed bandwidth, it may slow right down to
>>>bandwidth limited rates indistinguishable from those of a much slower
>>>chip.
>>>
>>>
>>>>>For me, I just revel in the Computer Age. A decade ago, people were
>>>>>predicting all sorts of problems breaking the GHz barrier. Today
>>>>>CPUs
>>>>>are routinely clocked at 3+ GHz, reaching for 4 and beyond. A decade
>>>>
>>>>I just picked up a Semptron 3000+, 1.5GB RAM, 120GB HD, CD-ROM,
>>>>video,
>>>>10/100 + intel 1000 Pro for $540 shipped. I was amazed.
>>>
>>>The Opterons tend to go for about twice that per CPU, but they are
>>>FAST,
>>>especially for their actual clock. The AMD-64's can be picked up for
>>>about the same and they too are fast. I haven't really done a
>>>complete
>>>benchmark run on the one I own so far, but they look intermediate
>>>between Opteron and everything else, at a much lower price.
>>>
>>> rgb
>>>
>>>--
>>>Robert G. Brown http://www.phy.duke.edu/~rgb/
>>>Duke University Dept. of Physics, Box 90305
>>>Durham, N.C. 27708-0305
>>>Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
>>>
>>>
>>>_______________________________________________
>>>Beowulf mailing list, Beowulf at beowulf.org
>>>To change your subscription (digest mode or unsubscribe) visit
>>>http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>
>--
>Joseph Landman, Ph.D
>Founder and CEO
>Scalable Informatics LLC,
>email: landman at scalableinformatics.com
>web : http://www.scalableinformatics.com
>phone: +1 734 786 8423
>fax : +1 734 786 8452
>cell : +1 734 612 4615
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
>http://www.beowulf.org/mailman/listinfo/beowulf
More information about the Beowulf
mailing list