[Beowulf] Opinions of Hyper-threading?

Thu Feb 28 09:13:40 PST 2008

On Feb 28, 2008, at 4:33 PM, Mark Hahn wrote:

>>>>> The problem with many (cores|threads) is that memory bandwidth   
>>>>> wall.  A fixed size (B) pipe to memory, with N requesters on  
>>>>> that  pipe ...
>>>> What wall?  Bandwidth is easy, it just costs money, and not much  
>>>> at  that. Want 50GB/sec[1] buy a $170 video card.  Want 100GB/ 
>>>> sec...  buy a
>>> Heh... if it were that easy, we would spend extra on more  
>>> bandwidth for
>>> Harpertown and Barcelona ...
>
> I think the point is that chip vendors are not talking about mere
> doubling of number of cores, but (apparently with straight faces),
> things like 1k GP cores/chip.
>
> personally, I think they're in for a surprise - that there isn't a  
> vast
> market for more than 2-4 cores per chip.

Microsoft might give a helping hand there by writing their own  
software more user friendly,
requiring somehow heavier processors :)

>>> limits, and no programming technique is going to get you around that
>>> limit per socket.  You need to change your programming technique  
>>> to go
>>> many socket.  That limit is the bandwidth wall.
>
> IMO, this is the main fallacy behind the current industry harangue.
> the problem is _NOT_ that programmers are dragging their feet, but  
> rather some combination of amdahl's law and the low average _inherent_
> parallelism of computation.  (I'm _not_ talking about MC or  
> graphics rendering here, but today's most common computer uses: web  
> and email.)
>
> the manycore cart is being put before the horse.  worse, no one has  
> really
> shown that manycore (and the presumed ccnuma model) is actually  
> scalable to large values on "normal" workloads.  (getting good  
> scaling for an AM
> CFD code on 128 cores in an Altix is kind of a different  
> proposition than
> scaling to 128 cores in a single chip.)
>
> as far as I know, all current examples of large ccnuma scaling are  
> premised on core:memory ratios of about 4:1 (4 it2 cores per bank  
> of dram in an Altix, for instance.)  I don't doubt that we can  
> improve memory bandwidth (and concurrency) per chip, but it's not  
> an area-driven
> process, so will never keep up.
>
> so: do an exponential and a sublinear trend diverge?  yes: meet  
> memory wall.
>
> what's missing is a reason to think that basically all workloads  
> can be made
> cache-friendly enough to scale to 10e2 or 10e3 cores.  I just don't  
> see that.
>

Some fields which are overly represented in this mailing list just  
require more RAM rather than cpu.
Just a few fields have embarrassingly parallel software that only  
needs cpu power and not much of a RAM.
Most of them are either encryption or security related searches.

There is however a growing number of fields where communication speed  
between the processors is very important,
not so much the bandwidth, but rather the latency.

As CPU's are that fast nowadays that algorithms can kick in where  
branching factor,
practically that is the time needed to get 1 step or iteration in the  
process further,
is heavily dependant upon communication speed between the processors  
and especially
reusing data stored in the (huge) RAM of other memory-nodes.

In the long run of course many fields will converge to such types of  
algorithms, field after field is
inventing algorithms like that, which is a logical consequence of the  
progress in hardware.

Todays highend hardware ALLOWS complex algorithms to get invented,  
which IMHO is a good thing.

Now let's invent something that makes me coffee :)

> it's really a memory-to-core issue: from what I see, the goal  
> should be something in the range of 1GB per core.  there are  
> examples up to 10G/core

1GB a core are standards that most supercomputers had a year or 8 ago.

It's quite interesting to see how RAM and latency between cpu's  
hasn't kept up pace with cpu crunching power.

> and down to 100M/core, but not really beyond that.  (except for stream
> processing, which is great stuff but _cries_ for non-general- 
> purpose HW.)
>
>> As data rates get higher, even really good bit error rates on the  
>> wire get to be too big.  Consider this.. a BER of 1E-10 is quite  
>> good, but if you're pumping 10Gb/s over the wire, that's an error  
>> every second.  (A BER of 1E-10 is a typical rate for something  
>> like 100Mbps link...).  So, practical systems
>
> I'm no expert, but 1e-10 seems quite high to me.  the docs I found  
> about 10G
> requirements all specified 1e-12, and claimed to have achieved 1e15 in
> realistic, long-range tests...

In the end the conclusion will be of course that we need a newer  
proces technology from ASML very badly
to get into production to produce CPU's with even more transistors in  
order to push some of the problems to the 1GB L3 cache :)

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf
>