[Beowulf] Woodcrest Memory bandwidth

Mon Aug 14 19:48:19 PDT 2006

Um, I haven't looked closely at Woodcrest lately, but everyone does
remember that on a write, you have to fetch the cache line that you are
writing to, right?  So, if you have a 10 GB/s memory system, the most a
copy should be able to do is:

read the source at 3.3 GB/s
read the destination at 3.3 GB/s
write the destination at 3.3 GB/s

What streams will report is 6.6 GB/s, which, um, matches the results
earlier in this thread.  

					Keith

On Mon, 2006-08-14 at 19:00 -0600, Stuart Midgley wrote:
> actually, latency determines bandwidth more than memory speed...  see 
> my rant(?) from a little over a year ago
> 
> http://www.beowulf.org/archive/2005-July/013294.html
> 
> prefetch etc. help, but the limiting factor is still latency.  Hence 
> the opterons have significantly higher real memory bandwidth (their 
> latency is about 1/3 that of xeons/p4's etc).  If you look at the 
> ia64's then they have even high latency again, but they can have a 
> huge number of outstanding loads (from memory its >80), so their 
> effective bandwidth is high.
> 
> Stu.
> 
> 
> On 15/08/2006, at 8:10, Joe Landman wrote:
> 
> > Hi Stu:
> >
> > Stu Midgley wrote:
> >> sorry, forgot to reply all... don't you hate gmail's interface 
> >> sometimes?
> >>
> >>
> >> What is the memory latency of the woodcrest machines?  Since memory
> >> latency really determines your memory bandwidth.
> >
> > Hmmm...  not for large block sequential accesses.  You can prefetch
> > these assuming enough intelligence in the code generator (heh), or
> the
> > hardware if the memory access pattern is fairly consistent.
> >
> > Latency really defines the random access local node GUPS, well, its
> > really more complex than that, but roughly that.
> >
> > That said, I would like to measure this.  I have an old code which 
> > does
> > this, any pointers on code other people would like me to run?  If
> its
> > not too hard (e.g. less than 15 minutes) I might do a few.
> >
> >> If Intel hasn't made any improvements in latency then the limited
> >> number of out-standing loads in the x86-64 architecture will limit 
> >> the
> >> bandwidth regarless of the MB/s you throw at it.
> >
> > Hmmm... Ok, you are implying that if your processor can consume the
> > load/store slots faster than it can launch them, and there are a 
> > limited
> > number of memory operations in flight (2? as I remember, not 
> > looking at
> > my notes now), it is going to be load-store pipeline limited, not
> > necessarily "bandwidth".  That is, the memory system would be
> > significantly faster than the CPU can consume.
> >
> > I haven't looked closely at the Woodcrest arch yet.  Don't know
> > precisely what they are doing here and how it differs from AMD.
> Would
> > be interesting.  So far I haven't been impressed with code that I
> > thought I should be really impressed with on this machine.  Oddly
> the
> > performance was about what we got out of the Core Duo on this 
> > platform.
> >
> 
> 
> --
> Dr Stuart Midgley
> Industry Uptake Program Leader
> iVEC, 'The hub of advanced computing in Western Australia'
> 26 Dick Perry Avenue, Technology Park
> Kensington WA 6151
> Australia
> 
> Phone: +61 8 6436 8545
> Fax: +61 8 6436 8555
> Email: industry at ivec.org
> WWW:  http://www.ivec.org
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> 
-- 
Keith D. Underwood                            Scalable Computing Systems
Senior Member of Technical Staff            Sandia National Laboratories
kdunder at sandia.gov