[Beowulf] Re: vectors vs. loops
Josip Loncaric
josip at lanl.gov
Wed May 4 09:07:13 PDT 2005
Eugen Leitl wrote:
> On Wed, May 04, 2005 at 09:19:35AM -0600, Josip Loncaric wrote:
>
>
>>That may work for games, but not for everyone. A common operation like
>>
>>C = A + B
>>
>>is very fast when A, B, and C are small enough to fit into the cache
>>simultaneously. However, for scientific computing, the size of these
>>vectors could be 1 GB each (per CPU!), and the problem is memory
>>bandwidth bound. Today's memory bandwidths cannot support full CPU
>>speed on a problem like this.
>
>
> There are tricks to optimize available memory bandwidth on modern x86
> architectures though, as described in
>
> http://leitl.org/docs/comp/AMD_block_prefetch_paper.pdf
>
> (and far more in http://leitl.org/docs/comp/AMD64softoptguide.pdf ).
Thanks for the links, but prefetching (which I usually recommend)
doesn't fix this problem: 2 GB needs to be read from RAM and 1 GB
written, with only 128 M double precision floating point operations.
This example needs 24 bytes of memory bandwidth per FLOP, much more than
today's RAM can deliver. If the CPU can issue ADD instructions at 3
GHz, to run at full speed we'd need about 72 GB/s in memory bandwidth.
Unfortunately, today's RAM supplies less than 5% of this requirement.
Real CFD code can do a bit more work per memory access, and benefits
from prefetching, but often runs into the same memory bandwidth
bottleneck as C=A+B. Prefetching can hide latency problems, but not
bandwidth bottlenecks.
Sincerely,
Josip
More information about the Beowulf
mailing list