[Beowulf] Opinions of Hyper-threading?

Thu Feb 28 06:50:27 PST 2008

Quoting Joe Landman <landman at scalableinformatics.com>, on Thu 28 Feb  
2008 05:20:01 AM PST:

> Bill Broadley wrote:
>>> The problem with many (cores|threads) is that memory bandwidth   
>>> wall.  A fixed size (B) pipe to memory, with N requesters on that   
>>> pipe ...
>>
>> What wall?  Bandwidth is easy, it just costs money, and not much at  
>>  that. Want 50GB/sec[1] buy a $170 video card.  Want 100GB/sec...   
>> buy a
>
> Heh... if it were that easy, we would spend extra on more bandwidth for
> Harpertown and Barcelona ...
>
> The point is that the design determines your hard/fixed per socket
> limits, and no programming technique is going to get you around that
> limit per socket.  You need to change your programming technique to go
> many socket.  That limit is the bandwidth wall.
>

And this is much the same as the earlier discussions on this list,  
when folks were building 8 and 16 processor clusters.  There, the  
bandwidth wall was the 10Mbps Ethernet interconnect, first through a  
hub, then a switch, etc.

This is sort of why any programming technique for speed up that relies  
on tight coupling (e.g. shared memory) can't scale infinitely.  At  
some point, the speed of light and physical size conspire to do you in.

If one wanted to design revolutionary distributed/parallel computing  
algorithms, one could probably work with floppy disks and sneakernet.   
If it works there, it will certainly work on any faster mechanism.   
See.. true computer science doesn't need a 1000 processor cluster.

Another cluster related computer science issue is to start dealing  
with unreliable links between the nodes of the cluster. The  
overwhelming majority of cluster codes assume that message passing is  
perfect and has no errors.  Sometimes this is provided transparently  
by the communications mechanism (i.e. TCP/IP promises in order,  
error-free delivery).  However, in the TCP case that comes at a cost..  
the latency isn't constant (because it achieves reliability by  
temporal redundancy:retries), and if your algorithm does some sort of  
scatter/gather and needs barrier synchronization, a late packet on one  
link brings the whole mass to a halt.

As data rates get higher, even really good bit error rates on the wire  
get to be too big.  Consider this.. a BER of 1E-10 is quite good, but  
if you're pumping 10Gb/s over the wire, that's an error every second.   
(A BER of 1E-10 is a typical rate for something like 100Mbps link...).  
  So, practical systems use some sort of FEC, but even with that, BERs  
of 1E-14 or 1E-15 are pretty much state of the art over shortish  
(meters) distances.  (It's a power/signal to noise ratio thing..How  
much energy can you put into sending one bit of information?)

Jim