[Beowulf] Re: vectors vs. loops
diep at xs4all.nl
Wed Apr 27 09:51:56 PDT 2005
The new released supercomputer in Netherlands, which was opened yesterday
and it is at work now is 27.5 tflop and has cost 8 million euro.
that's roughly 27.5 gflop / 8k euro = far over 3 gflop per 1000 euro.
So your estimate is not far off.
Please note this supercomputer doesn't fall under the normal organisations
but falls under the lofar.nl project and RUG (university Groningen) has to
do with it.
In general your story is true that majority of todays software isn't
vectorizable. However the counter side is that the majority of system TIME
goes to just a few libraries. Usually fortran libraries that do matrix
A co processor for intel processors is wishful thinking for now, they have
instead made the processors very cheap to produce and have created pathetic
small L1 caches in their processors with as a result that whole ranges of
applications are pathetic slow at their cheap processors.
Open market competition is what is going to force these manufacturers to
rethink their strategy (hopefully).
Until now intel has run into the opposite direction of what we would like,
making only real expensive supercomputer processors instead of cheap ones :)
Even more interesting than CELL would be a cell type processor that uses
the same ideas but combines it with help processors that run integer
applications real fast.
Of course intel is not going to deliver that as their small L1 caches of P4
suck too much for serious applications that are branch dependant (such as
databases). Also pentium-m's L1 cache, though very good, is too tiny for
future 64 bits usage.
So my personal hope is more directed onto an X-core AMD processor, or some
other manufacturer that produces a great cpu.
I'm a bit worried about RAMBUS being involved into cell i have to admit.
When they were involved in intels P4 it sucked real bad, both in
performance for my applications as well as price; only when Northwood with
DDR ram came, P4 recovered quite some.
Affordable RAM is more interesting than reading online paper claims about
major bandwidth that we cannot afford.
Yet the main point is that a CELL processor serves not only a certain market.
I estimate for my own branchy integer application it to be same speed like
an 8 processor Xeon MP 2.8Ghz.
So even there it can kick some butt, if it has an affordable price, which a
8 processor Xeon MP never will have.
At 09:47 AM 4/27/2005 -0400, Robert G. Brown wrote:
>On Tue, 26 Apr 2005, Ben Mayer wrote:
>> The point of the study was teaching them parallel programming with MPI
>> and UPC to see what the (dis)liked/(mis)understood to gain a better
>> understanding of how we can help them learn. An observation was that
>> they wrote their code in such a way that it was difficult (or
>> impossible) for the compiler to vectorize the code (even with
>This isn't terribly surprising, at least to me. Back in the old days,
>when the North Carolina supercomputer center was still around and had a
>Cray YMP as its centerpiece, a whole lot of the stuff run on it wasn't
>particularly vectorizable. People would apply for a "supercomputer
>account" and expect their ported code to run at 100's of MFLOPS
>(standard at the time on the desktop was maybe 1-4 MFLOPS depending on
>how expensive your Unix desktop or departmental compute server was).
>However, most code doesn't vectorize too well (even, as you say, with
>directives), so people would end up getting 25 MFLOPs out of 300 MFLOPs
>possible -- faster than a desktop, sure, but using a multimillion dollar
>machine to get a factor of MAYBE 10 in speedup compared to (at the time)
>$5-10K machines. In the meantime, I'm sure that there were people who
>had code that DID vectorize well pulling their hair because of all those
>100 hour accounts that basically wasted 90% of the resource.
>My own code doesn't vectorize too well because it isn't heavy on linear
>algebra and the loops are over 3 dimensional lattices where
>nearest-neighbor sums CANNOT be local in a single memory stream and
>where a cluster methodology that makes the code's convergence SCALE
>faster requires that sites be selected for cluster inclusion that are
>effectively random memory accesses (the slowest possible,
>cache-defeating form of memory access). Rob Peter, pay Paul.
>However, it PARALLELIZES just gangbusters fine, since accumulating
>statistics on independent runs is embarrassingly parallel and produces
>nearly perfect linear speedup. That (in the early 90's) was when I
>realized that a cluster of 10-20 $8000 Sun workstations (all but one or
>two paid for by other people or the University) could give me twice the
>performance of a Cray on my code, not on a measley 100 hour "research"
>account but on a 24x7 basis where every week that went by I accumulated
>more than twice as many hours. And the workstations were all in use by
>humans at the same time as desktops! My jobs just ran quietly in the
>background and didn't noticeably slow the sunview or SGI GUI. Finally,
>the code required no special structure or directives -- it was nice,
>portable, serial code.
>The moral of this particular story is to NOT try to force code onto a
>vector environment unless it is, really, a vector task. Indeed, don't
>force code into a PARALLEL environment (e.g. into PVM or MPI) unless it
>is a NON-TRIVIAL parallel task (I spent a lot of time rewriting my code
>as master-slave stuff in PVM, only to finally realize that EP tasks are
>more easily managed by just running the damn jobs independently via e.g.
>a script and accumulating results with other scripts, because writing
>ROBUST PVM (or MPI) code -- code that can survive a casual reboot or
>interruption of any particular node -- is Not Easy.
>If it IS a vector (or nontrivial parallel, or both) task, then the
>problem almost by definition will EITHER require extensive "computer
>science" level study -- work done with Ian Foster's book, Amalsi and
>Gottlieb for parallel and I don't know what for vector as it isn't my
>area of need or expertise and Amazon isn't terribly helpful (most books
>on vector processing deal with obsolete systems or are out of print, it
>However, the observation that in many cases using an appropriate library
>in the vector case is the way to go is dead on the money. One important
>example is linear algebra, which is something that generally DOES
>vectorize well (almost by definition:-). ATLAS clearly demonstrates
>that linear algebra subroutines can vary in speed by a factor of 3 or
>more just by switching algorithms and blocking/stride in a way that is
>optimal for the memory/CPU sub-architectures in use. Ordinary humans
>(that is, intro students in any CPS class) aren't going to see this kind
>of speedup from writing mundane loops that implement standard algorithms
>in straightforward ways. The problem is obviously complicated still
>further by the existence of "vector" co-processors that can do certain
>operations on certain memory blocksizes very quickly, which a whole
>DISTINCT set of memory latencies and bandwidths to contend with,
>complicated still further by SSE instructions and on-CPU specialty
>Clearly there is no "solution" that fits all of the possible cases;
>equally clearly ANY ATTEMPT TO OPTIMIZE for a particular architecture
>instance comes AT THE EXPENSE OF code portability, code readability,
>code instrumentation with numerous #ifdefs and pragmas. Library calls
>for "separable" components that might vectorize at least provide an API
>layer that can leave the code itself portable and require that all
>optimization energy go into the library (or not).
>> >From the experiences of helping instruct two parallel classes, and my
>> own experiences learning (and observing), it is a very rare case that
>> a scientific code (or most any large code) will be be automatically
>> vectorized all of the way through, unless one has a good understanding
>> of computer architecture AND significant experience producing code
>> that the compiler can optimize for a given arch.
>Absolutely, dead on the money. Especially since there is no "COTS"
>vector (co)processor, so that these days most vectorization involves the
>limited amount available within SSE on OTS cpus or one-of-a-kind vector
>units that are neither standard nor code-portable.
>It is an interesting question as to what would happen if e.g. Intel or
>AMD went the co-processor route once again, and developed a dedicated
>function, really really fast vector/float coprocessor for their 64 bit
>lines of CPUs and standardized a machine language interface for the
>architecture so that compilers could "automatically" use it. I'd guess
>that the problems such an approach would face would be profound, as
>memory bandwidth is likely to be a rate limiting obstacle that might
>make the result no faster than memory-bound linear algebra code is
>already on OTS CPUs. As Mark has somewhat wryly observed already. It
>is quite probable that they HAVE thought about doing this and rejected
>it, both because of a relatively limited market for the result and
>because it may NOT speed things up that much compared to what they can
>do with a general purpose CPU with efficient floating point pipelines
>and state-of-the-art memory architecture.
>For me, I just revel in the Computer Age. A decade ago, people were
>predicting all sorts of problems breaking the GHz barrier. Today CPUs
>are routinely clocked at 3+ GHz, reaching for 4 and beyond. A decade
>ago, 10 MFLOPS was still a respectable speed and only available on
>systems that cost order of ten thousand dollars. Today a GFLOP is a
>mundane speed and available on systems that cost order of one thousand
>dollars. A decade ago a "supercomputer" provided a few hundred MFLOPS,
>costs millions of dollars for the hardware, hundreds of thousands of
>dollars in annual maintenance, hundreds of thousands more each year for
>housing, cooling, and a staff of professional acolytes that had to be
>propitiated in order to make anything like full use of the resource
>(with the exception of the relatively few of us who had discovered
>cluster computing by design or accident). Any sort of computer CAPABLE
>of more than the most mundane performance was considered a >>munition<<
>by Our Government and export-restricted lest Bad People design nuclear
>weapons with them.
>Today, supercomputers routinely get 100's of GFLOPS on up, at a cost of
>less than $1000/GFLOP. Anybody can build them. Amazingly, in spite of
>the fact that my aged and obsolete laptop would have been a munition a
>decade ago and that just about every country on the planet has at least
>ONE cluster capable of eating any ten munitions-quality supercomputers a
>decade ago, Bad People apparently are no closer or farther from nuclear
>devices than they ever were (as their primary obstacle is obtaining
>weapons-grade nuclear material, not coming up with or implementing an
>"adequately" functional design as the latter is really pretty trivial in
>terms of modern technology and a bomb doesn't have to be "perfect" to be
>Ahh, it's good to be alive.
>> CVL is a library that has to be called. This means that one has to
>> explicitly make a function call to the lib for your code to have
>> vectorized segments. Function calls greatly diminish a compilers
>> ability to optimize. Cray's compilers (and recently gcc) will
>> automatically vectorize code for you, if it can. Cray's will even make
>> it run in parallel, across processors, to some extent. The
>> optimization can be greatly helped by knowing how to help the
>> compiler. Learning how to guide the compiler takes a good deal of time
>> to learn. Unless Sony/IBM releases an awesome compiler and some great
>> vectorized sub-systems (a graphics render, physics engine, etc) the
>> Cell's (PS3) speed will be hindered greatly by code not vectorizing.
>> This could be a great opportunity to make a large sum of money.
>> NESL does not appear to have momentum. Started in 1991 and the last
>> paper is in 1997. They talk about running on Paragon and CM5 (I
>> believe that UMN had serial #1), not about Cray X1, which is what we
>> were using.
>> scandal's web site makes it look like it is a research language only.
>> MPI is standard, UPC has compilers from many vendors (including Cray).
>> MPI is installed everywhere and runs everywhere, UPC can run on
>> distributed memory and SSI systems, and is currently being explored
>> for strengths and weaknesses.
>> Heck Matlab might be a good choice. Everything is a matrix AND it runs
>> in parallel... now if only our large machine with matlab would be
>> stable we could give some more assignments. ;)
>> On 4/26/05, Andrew Piskorski <atp at piskorski.com> wrote:
>> > On Tue, Apr 26, 2005 at 10:01:25AM -0500, Ben Mayer wrote:
>> > > I actually just did a small study of how well students in a parallel
>> > > computing class write parallel codes on X1 with MPI and UPC. One of
>> > > the things that stood out is that they tended to do odd things in
>> > > their loops that inhibit code from vectorizing.
>> > So, why were these students writing loops in the first place? If the
>> > goal is to generate vectorized code, wouldn't it make more sense to
>> > use a language or library which directly supports vector commands?
>> > E.g., although they're used for serial not parallel programming, S and
>> > R are vector oriented in a pleasantly convenient way. There do exist
>> > languages and libraries specifically intended for vector programming,
>> > like CVL or NESL, right, so, are they not useful?
>> > http://www-2.cs.cmu.edu/~scandal/
>> > --
>> > Andrew Piskorski <atp at piskorski.com>
>> > http://www.piskorski.com/
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>Robert G. Brown http://www.phy.duke.edu/~rgb/
>Duke University Dept. of Physics, Box 90305
>Durham, N.C. 27708-0305
>Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf