[Beowulf] Woodcrest Memory bandwidth

Mon Aug 14 18:46:21 PDT 2006

Heh...

Stuart Midgley wrote:
> actually, latency determines bandwidth more than memory speed...  see my
> rant(?) from a little over a year ago
> 
> http://www.beowulf.org/archive/2005-July/013294.html

Someday the english language will be a complete impediment to
understanding - Matt Austern

We are in fairly violent agreement, but were talking about two different
things.

I usually phrase what you write as

	T(block) = L + Size/B

where size is the size of the data object you want to transfer, L is the
latency, B is the bandwidth to move the thing you want.

This is a (highly) oversimplified model, but it works quite well.  You
have in reality, multiple latencies, and multiple bandwidths to contend
with.  If you assume that one overall latency (main memory) and one
overall bandwidth (main memory) largely dominate the data motion time in
the machine, yes, this makes perfect sense.

Your example in the rant is a very good one, as it demonstrates that if
you take a cache line load to get access to your data, you have to pay
the latency cost of that cache line load.  Gets somewhat more complex if
you have a TLB load, or L1<->L2 or ...

If on the other hand you have a well defined memory access pattern,
where you can "pre-insert" software prefetch, or have a hardware
prefetcher, you can largely hide that latency.  Not completely, but you
can overlap memory access with computation.  The problem comes in if you
fill up all the memory access slots.  If you do this, the prefetch may
actually hurt you.  Moreover in a very tight cache situation, you might
accidentally expel a needed cache line with a prefetch.

Last year when I taught a course on this stuff at my alma mater, I had a
few example codes that showed the need for reasonable memory access
patterns for data.  All it takes is accessing a large array improperly.
 You quickly get a feel for the impact latency will have on overall data
motion performance.  This isn't bandwidth per se, more of the time
required to move a block of data, for which bandwidth is just one of the
components.

Happily (or sadly) the same model works for network resources, for disk
access, ...

> 
> prefetch etc. help, but the limiting factor is still latency.  Hence the
> opterons have significantly higher real memory bandwidth (their latency
> is about 1/3 that of xeons/p4's etc).  If you look at the ia64's then
> they have even high latency again, but they can have a huge number of
> outstanding loads (from memory its >80), so their effective bandwidth is
> high.

It was either Hennesey or Mashey who said "You can always buy bandwidth,
but latency is forever".  Large latencies will have huge impacts on
certain workloads.  Especially those that move the data around.  If you
want to define an effective bandwidth

	B* =  Size/T = Size/(L + Size/B)

that would make perfect sense to me.  B is a property of the hardware,
as is L.  B* would be what you observe in moving data in your workload,
and would incorporate your memory access pattern.  Not all memory access
patterns are as nice as streams.	

My understanding of GUPS for random memory access is that it is
something similar to B* .  Could be wrong, and would defer to people
whom have looked into it.

Joe

> 
> Stu.
> 
> 
> On 15/08/2006, at 8:10, Joe Landman wrote:
> 
>> Hi Stu:
>>
>> Stu Midgley wrote:
>>> sorry, forgot to reply all... don't you hate gmail's interface
>>> sometimes?
>>>
>>>
>>> What is the memory latency of the woodcrest machines?  Since memory
>>> latency really determines your memory bandwidth.
>>
>> Hmmm...  not for large block sequential accesses.  You can prefetch
>> these assuming enough intelligence in the code generator (heh), or the
>> hardware if the memory access pattern is fairly consistent.
>>
>> Latency really defines the random access local node GUPS, well, its
>> really more complex than that, but roughly that.
>>
>> That said, I would like to measure this.  I have an old code which does
>> this, any pointers on code other people would like me to run?  If its
>> not too hard (e.g. less than 15 minutes) I might do a few.
>>
>>> If Intel hasn't made any improvements in latency then the limited
>>> number of out-standing loads in the x86-64 architecture will limit the
>>> bandwidth regarless of the MB/s you throw at it.
>>
>> Hmmm... Ok, you are implying that if your processor can consume the
>> load/store slots faster than it can launch them, and there are a limited
>> number of memory operations in flight (2? as I remember, not looking at
>> my notes now), it is going to be load-store pipeline limited, not
>> necessarily "bandwidth".  That is, the memory system would be
>> significantly faster than the CPU can consume.
>>
>> I haven't looked closely at the Woodcrest arch yet.  Don't know
>> precisely what they are doing here and how it differs from AMD.  Would
>> be interesting.  So far I haven't been impressed with code that I
>> thought I should be really impressed with on this machine.  Oddly the
>> performance was about what we got out of the Core Duo on this platform.
>>
> 
> 
> -- 
> Dr Stuart Midgley
> Industry Uptake Program Leader
> iVEC, 'The hub of advanced computing in Western Australia'
> 26 Dick Perry Avenue, Technology Park
> Kensington WA 6151
> Australia
> 
> Phone: +61 8 6436 8545
> Fax: +61 8 6436 8555
> Email: industry at ivec.org
> WWW:  http://www.ivec.org
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615