[Beowulf] Opinions of Hyper-threading?

Wed Feb 27 11:45:51 PST 2008

>> imagine if, instead of 8 cores onchip, you just had 8 "thread sequence"
>> units that contained fetch/decode, architected registers and retirement.
>> and a single big pool of scoreboarded functional units, of course.  the 
>> advantage being that one thread could use many units.  as opposed to a 
>> static 8-core where each thread gets only the unit(s) in its core...
>
> Hi Mark,
>
> Let's calculate with your imaginary chip where you get rid of the multicore 
> thought and have to get rid of out of order in order to get your thread 
> sequence idea to work:
>
> If you've got 8 threads that execute each 1 instruction a cycle,
> that's:
>
> 8 * 1 * 3Ghz = 24 Gflop double precision

no.  I did not say that each thread unit was single-scalar.

> 4 cores * 3 instructions a cycle * 2 DP in each SSE2/SSE3 vector * 2.4Ghz = 
> 24 * 2.4 = 48 + 9.6 = 57.6

"my" 8-thread-unit chip would also have superscalar dispatch, and would 
obviously also have SIMD units.  as far as I can tell, you missed 
the WHOLE point of my digression: that the current manycore trend is a form
of static partitioning.

such chopping up into little pieces leads to less efficiency,
since it's hard to ensure your workload is always perfectly balanced.
such balance is exponentially more unlikley with the core count explodes.

> Multicore and out of order are big winners that butcher RISC and the old 
> Alpha engineers SMT idea completely,
> with exception of power usage.

don't be so naive - all current processors owe most of their credit to 
previous architectures (_especially_ trends like RISC, OOO, Alpha and SMT).

> Multicore right now means BOOM you are factor 4.0 faster nearly (3.8 in case 
> of my chessproggie), and out of order means you have
> a potential of 3 to 4 instructions a cycle which is a big winner too.

you're confusing superscalar with OOO here.  but again, that's not the point.
there's nothing wrong with the manycore trend, just that it's kind of dumb - 
enough to make me think chip architects who cut their teeth on RISC are now 
looking forward to retirement rather than pushing for excellent designs ;)

> Replacing that with some other technique SMT means the other technique SMT 
> needs to find a factor 12 in speed somewhere.

again, replication of cores is trivial, architecturally, and soaks up the 
extra transistors.  my question is: are there improved or better ways?

observe, for instance, that your code is clearly cache-happy.  good for you!
not all workloads are, and because offchip memory interfaces are not
following moore's law, memory is a real problem only exacerbated by manycore.