[Beowulf] Opinions of Hyper-threading?

Thu Feb 28 08:17:04 PST 2008

Quoting Mark Hahn <hahn at mcmaster.ca>, on Thu 28 Feb 2008 07:33:07 AM PST:

>>>>> The problem with many (cores|threads) is that memory bandwidth    
>>>>> wall.  A fixed size (B) pipe to memory, with N requesters on   
>>>>> that  pipe ...
>>>>
>>>> What wall?  Bandwidth is easy, it just costs money, and not much   
>>>> at  that. Want 50GB/sec[1] buy a $170 video card.  Want   
>>>> 100GB/sec...  buy a
>>>
>>> Heh... if it were that easy, we would spend extra on more bandwidth for
>>> Harpertown and Barcelona ...
>
> I think the point is that chip vendors are not talking about mere
> doubling of number of cores, but (apparently with straight faces),
> things like 1k GP cores/chip.
>
> personally, I think they're in for a surprise - that there isn't a vast
> market for more than 2-4 cores per chip.

Perhaps not today.  But then, Thomas Watson said there wasn't a vast  
market for computers.. perhaps 5 world wide.

No question that folks will have to figure out how to effectively use  
all that parallelism.  (e.g. each processor deals with one page of a  
Word document, or a range of Excel cells?).  I can see a lot of fairly  
easily coded things dealing with rapid search (e.g. which of my  
documents have the word hyperthreading and Hahn in them).  Right now,  
search and retrieval of unstructured data is a very computationally  
intensive task that millions of folks suffer through daily. (How many  
of you find Google over the web faster than Microsoft's "Search for  
File or Folder.." (or, greping the entire disk) on your local machine? )

And we cluster dweebs have a headstart on them... we've been dealing  
with figuring out how to spread problems that are too big to fit on  
one node across multiples for years now.  After all, billg's  
programming fame is from a flood fill graphics algorithm, and look how  
well he's done with that <grin>.

>
>>> limits, and no programming technique is going to get you around that
>>> limit per socket.  You need to change your programming technique to go
>>> many socket.  That limit is the bandwidth wall.
>
> IMO, this is the main fallacy behind the current industry harangue.
> the problem is _NOT_ that programmers are dragging their feet, but
> rather some combination of amdahl's law and the low average _inherent_
> parallelism of computation.  (I'm _not_ talking about MC or graphics
> rendering here, but today's most common computer uses: web and email.)

Text search and retrieval is where it's at.  almost 30 years ago I  
worked on developing a piece of office equipment the size of a 2  
drawer filecabinet that would do just that, hooked up to a bunch of  
word processors (i.e. find me that letter we sent to John Smith).. It  
was expensive! It had a 80MB (or 160MB) disk drive (huge!), it could  
search thousands of pages in the blink of an eye. (called the  
OFISfile, sold by Burroughs)  And people DID buy it.  And, without  
giving away the internals, it could have made excellent use of a 1000  
core type processor.

Granted, the googles of the world will (correctly) contend that an  
equally good solution is to have a good comm link to a centralized  
search and retrieval engine (doesn't even have to be that fast.. just  
comparable to the time it takes me to enter the request and read the  
results).  But, they too can use parallelism.

>
> the manycore cart is being put before the horse.  worse, no one has really
> shown that manycore (and the presumed ccnuma model) is actually
> scalable to large values on "normal" workloads.  (getting good scaling
> for an AM
> CFD code on 128 cores in an Altix is kind of a different proposition than
> scaling to 128 cores in a single chip.)

To a certain extent it's an example of  build it and they will come  
(to 10% of the things that are built, the other 90% are interesting  
blips left by the side of the road).

When compilers were introduced, I'm sure the skilled machine language  
coders said.. hmmph, we can do just fine with our octal and hex,  
there's no expressed demand for high level languages. (Kids..get offa  
my lawn!)  Heck, the plugboard programmers on EAM equipment probably  
said that to the guys working with stored program computers. And  
before that, the supervisor of the computer pool probably  said that  
to the plugboard guys, as he gazed over a room full of Marchand  
calculators with computers punching numbers and pulling the handles.

>
> what's missing is a reason to think that basically all workloads can be made
> cache-friendly enough to scale to 10e2 or 10e3 cores.  I just don't see that.

Not all workloads... just enough so that it forms a significant  
market. and text search and retrieval is a pretty big consumer of CPU  
cycles, in the big wide world (as opposed to the specialized world of  
large numeric simulations and the like that have historically been  
hosted on clusters)

Remember, the recurring cost is basically related to the size of the  
die, not what's on it.  So, if there's a significant market for 10,000  
processor widgets, they'll be made, and cheaply.

>> As data rates get higher, even really good bit error rates on the   
>> wire get to be too big.  Consider this.. a BER of 1E-10 is quite   
>> good, but if you're pumping 10Gb/s over the wire, that's an error   
>> every second.  (A BER of 1E-10 is a typical rate for something like  
>>  100Mbps link...). So, practical systems
>
> I'm no expert, but 1e-10 seems quite high to me.  the docs I found about 10G
> requirements all specified 1e-12, and claimed to have achieved 1e15 in
> realistic, long-range tests...

That's probably the error rate above the PHY layer. I.e. after the  
forward error correction.  And the 10G requirement is tighter than the  
100Mbps requirement, just to make FEC possible with reasonable  
redundancy.  Typically, you want a raw PHY BER at least 100x away from  
the data rate (e.g. 1E8 bps->1E-10 BER, 1E10 bps->1E-12 BER)