[Beowulf] Opinions of Hyper-threading?
james.p.lux at jpl.nasa.gov
Thu Feb 28 08:17:04 PST 2008
Quoting Mark Hahn <hahn at mcmaster.ca>, on Thu 28 Feb 2008 07:33:07 AM PST:
>>>>> The problem with many (cores|threads) is that memory bandwidth
>>>>> wall. A fixed size (B) pipe to memory, with N requesters on
>>>>> that pipe ...
>>>> What wall? Bandwidth is easy, it just costs money, and not much
>>>> at that. Want 50GB/sec buy a $170 video card. Want
>>>> 100GB/sec... buy a
>>> Heh... if it were that easy, we would spend extra on more bandwidth for
>>> Harpertown and Barcelona ...
> I think the point is that chip vendors are not talking about mere
> doubling of number of cores, but (apparently with straight faces),
> things like 1k GP cores/chip.
> personally, I think they're in for a surprise - that there isn't a vast
> market for more than 2-4 cores per chip.
Perhaps not today. But then, Thomas Watson said there wasn't a vast
market for computers.. perhaps 5 world wide.
No question that folks will have to figure out how to effectively use
all that parallelism. (e.g. each processor deals with one page of a
Word document, or a range of Excel cells?). I can see a lot of fairly
easily coded things dealing with rapid search (e.g. which of my
documents have the word hyperthreading and Hahn in them). Right now,
search and retrieval of unstructured data is a very computationally
intensive task that millions of folks suffer through daily. (How many
of you find Google over the web faster than Microsoft's "Search for
File or Folder.." (or, greping the entire disk) on your local machine? )
And we cluster dweebs have a headstart on them... we've been dealing
with figuring out how to spread problems that are too big to fit on
one node across multiples for years now. After all, billg's
programming fame is from a flood fill graphics algorithm, and look how
well he's done with that <grin>.
>>> limits, and no programming technique is going to get you around that
>>> limit per socket. You need to change your programming technique to go
>>> many socket. That limit is the bandwidth wall.
> IMO, this is the main fallacy behind the current industry harangue.
> the problem is _NOT_ that programmers are dragging their feet, but
> rather some combination of amdahl's law and the low average _inherent_
> parallelism of computation. (I'm _not_ talking about MC or graphics
> rendering here, but today's most common computer uses: web and email.)
Text search and retrieval is where it's at. almost 30 years ago I
worked on developing a piece of office equipment the size of a 2
drawer filecabinet that would do just that, hooked up to a bunch of
word processors (i.e. find me that letter we sent to John Smith).. It
was expensive! It had a 80MB (or 160MB) disk drive (huge!), it could
search thousands of pages in the blink of an eye. (called the
OFISfile, sold by Burroughs) And people DID buy it. And, without
giving away the internals, it could have made excellent use of a 1000
core type processor.
Granted, the googles of the world will (correctly) contend that an
equally good solution is to have a good comm link to a centralized
search and retrieval engine (doesn't even have to be that fast.. just
comparable to the time it takes me to enter the request and read the
results). But, they too can use parallelism.
> the manycore cart is being put before the horse. worse, no one has really
> shown that manycore (and the presumed ccnuma model) is actually
> scalable to large values on "normal" workloads. (getting good scaling
> for an AM
> CFD code on 128 cores in an Altix is kind of a different proposition than
> scaling to 128 cores in a single chip.)
To a certain extent it's an example of build it and they will come
(to 10% of the things that are built, the other 90% are interesting
blips left by the side of the road).
When compilers were introduced, I'm sure the skilled machine language
coders said.. hmmph, we can do just fine with our octal and hex,
there's no expressed demand for high level languages. (Kids..get offa
my lawn!) Heck, the plugboard programmers on EAM equipment probably
said that to the guys working with stored program computers. And
before that, the supervisor of the computer pool probably said that
to the plugboard guys, as he gazed over a room full of Marchand
calculators with computers punching numbers and pulling the handles.
> what's missing is a reason to think that basically all workloads can be made
> cache-friendly enough to scale to 10e2 or 10e3 cores. I just don't see that.
Not all workloads... just enough so that it forms a significant
market. and text search and retrieval is a pretty big consumer of CPU
cycles, in the big wide world (as opposed to the specialized world of
large numeric simulations and the like that have historically been
hosted on clusters)
Remember, the recurring cost is basically related to the size of the
die, not what's on it. So, if there's a significant market for 10,000
processor widgets, they'll be made, and cheaply.
>> As data rates get higher, even really good bit error rates on the
>> wire get to be too big. Consider this.. a BER of 1E-10 is quite
>> good, but if you're pumping 10Gb/s over the wire, that's an error
>> every second. (A BER of 1E-10 is a typical rate for something like
>> 100Mbps link...). So, practical systems
> I'm no expert, but 1e-10 seems quite high to me. the docs I found about 10G
> requirements all specified 1e-12, and claimed to have achieved 1e15 in
> realistic, long-range tests...
That's probably the error rate above the PHY layer. I.e. after the
forward error correction. And the 10G requirement is tighter than the
100Mbps requirement, just to make FEC possible with reasonable
redundancy. Typically, you want a raw PHY BER at least 100x away from
the data rate (e.g. 1E8 bps->1E-10 BER, 1E10 bps->1E-12 BER)
More information about the Beowulf