[Beowulf] Re: vectors vs. loops

Vincent Diepeveen diep at xs4all.nl
Tue May 3 17:40:17 PDT 2005


At 06:03 PM 5/3/2005 +0200, Philippe Blaise wrote:
>Robert G. Brown wrote:
>
>>....
>>
>>Still, the marketplace speaks for itself.  It doesn't argue, and isn't
>>legendary, it just is.
>>....
>>  
>>
>
>
>But, does the hpc marketplace have a direction ?
>
>Few years ago, some people had a "fantastic vision" to replace the 
>vector machines market :
>use big clusters of SMPs with the help of the new paradigm of hybrid 
>mpi/openmp programming.
>Then the main vendors (usa), except Cray, were very happy to sell giant 
>clusters of smp machines.
>
>Nevertheless, the japanese guys built the "earth simulator" ; which is 
>still the most powerful machine in the world
>(don't trust this stupid top500 list).
>
>Then Cray came back ... with vector machines...
>
>Don't underestimate the power of vector machines.
>Yes Fujitsu or NEC vector machines are still very efficient, even with 
>non contiguous memory access (!!).
>
>One year ago, the only cpus that sometimes were able to equal vectorial 
>cpus were alpha (ev7) and itanum2 with
>big caches and / or fast memory access. Remember that alpha is dead. 
>Have a look to the itanium2 market shares.
>
>The marketplace is not a good argument at all.
>
>Vectorization and parallelization are compatible
>Hybrid mpi/openmp programming  is a harder task than mpi/vector programming.
>If you have enough money and if your program is vectorizable, buy a 
>vector machine of course.
>
>Cluster of SMPs ? they will remain an efficient and low cost solution, 
>(and quite easy to be sold
>by a mass vendor).
>And thanks to cluster of SMPs with the help of linux, the HPC market is 
>now "democratic".
>
>Of course, it would be nice to have a true vector unit on a P4 or Opteron.
>But the problem will be the memory access again.

P4 has a lot of weak chains in its caches indeed, but there is weaker ones.

If we speak about opteron let's assume we want to do a bit faster
multiplication of a few big integers, but not too big, so FFT is useless,
but other methods are interesting to use.

You want to multiply 64 bits x 64 bits producing 128 bits result.

Opteron can do that every other cycle. So it has a 2 cycle latency so to
speak.

Way nicer would be every cycle 1 multiplication, or even 2.

So basically such multiplication work can get speeded up drastically by
improving the chip rather than the memory bandwidth.

A 2 fold speed improvement would not be so bad.

Oh about L1 cache, the L1 cache has no problems keeping up. It can do 2
reads simultaneously from L1 cache.

In fact all that complaints about memory bandwidth IMHO is a lack of
understanding how hardware works.

Please have a good look at a benchmark of dual core opteron. Yes it SHARES
the same memory controller for 2 cores now:

http://www.sudhian.com/showdocs.cfm?aid=667&pid=2543

So you see diep has a scaling of 3.92 at a dual opteron, dual core.

That means basically that memory, despite it using a memory profile of
400MB, hardly is a problem.

Still not convinced?

Why not multiply a matrix size 200MB times a matrix 200MB.

How much memory bandwidth (non-L1/L2 cache bandwidth) does it generate,
when very efficient programmed, versus how many CALCULATIONS in the
hardware are needed?

If you put the 2 besides each other you will soon realize the real problem.

You want more multiplications+adds a second, not so much memory, assuming a
decent L3 cache size.

Vincent

>Bye,
>
>  Phil.
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>



More information about the Beowulf mailing list