[Beowulf] Opinions of Hyper-threading?

Thu Feb 28 07:33:07 PST 2008

>>>> The problem with many (cores|threads) is that memory bandwidth  wall.  A 
>>>> fixed size (B) pipe to memory, with N requesters on that  pipe ...
>>> 
>>> What wall?  Bandwidth is easy, it just costs money, and not much at  that. 
>>> Want 50GB/sec[1] buy a $170 video card.  Want 100GB/sec...  buy a
>> 
>> Heh... if it were that easy, we would spend extra on more bandwidth for
>> Harpertown and Barcelona ...

I think the point is that chip vendors are not talking about mere
doubling of number of cores, but (apparently with straight faces),
things like 1k GP cores/chip.

personally, I think they're in for a surprise - that there isn't a vast
market for more than 2-4 cores per chip.

>> limits, and no programming technique is going to get you around that
>> limit per socket.  You need to change your programming technique to go
>> many socket.  That limit is the bandwidth wall.

IMO, this is the main fallacy behind the current industry harangue.
the problem is _NOT_ that programmers are dragging their feet, 
but rather some combination of amdahl's law and the low average _inherent_
parallelism of computation.  (I'm _not_ talking about MC or graphics 
rendering here, but today's most common computer uses: web and email.)

the manycore cart is being put before the horse.  worse, no one has really
shown that manycore (and the presumed ccnuma model) is actually scalable 
to large values on "normal" workloads.  (getting good scaling for an AM
CFD code on 128 cores in an Altix is kind of a different proposition than
scaling to 128 cores in a single chip.)

as far as I know, all current examples of large ccnuma scaling are 
premised on core:memory ratios of about 4:1 (4 it2 cores per bank of 
dram in an Altix, for instance.)  I don't doubt that we can improve 
memory bandwidth (and concurrency) per chip, but it's not an area-driven
process, so will never keep up.

so: do an exponential and a sublinear trend diverge?  yes: meet memory wall.

what's missing is a reason to think that basically all workloads can be made
cache-friendly enough to scale to 10e2 or 10e3 cores.  I just don't see that.

it's really a memory-to-core issue: from what I see, the goal should be 
something in the range of 1GB per core.  there are examples up to 10G/core
and down to 100M/core, but not really beyond that.  (except for stream
processing, which is great stuff but _cries_ for non-general-purpose HW.)

> As data rates get higher, even really good bit error rates on the wire get to 
> be too big.  Consider this.. a BER of 1E-10 is quite good, but if you're 
> pumping 10Gb/s over the wire, that's an error every second.  (A BER of 1E-10 
> is a typical rate for something like 100Mbps link...).  So, practical systems

I'm no expert, but 1e-10 seems quite high to me.  the docs I found about 10G
requirements all specified 1e-12, and claimed to have achieved 1e15 in
realistic, long-range tests...