[Beowulf] Opinions of Hyper-threading?

Wed Feb 27 15:18:49 PST 2008

-------------- Original message -------------- 
From: Ashley Pittman <apittman at concurrent-thinking.com> 

> I saw a talk which said SMT was worth a maximum of 20% on power5 and 
> often performed worse than if it had been tured off. This correlates 
> well with my experience of it on Intel CPUs. 

As Joe Landman suggested the notion of a thread (as a logical construct representing parallelizable work) can be reduced to a single instruction.  In this case, the logical distance between the two work loads is minimal and managed by OoO hardware (or VLIW).  As the separation of the parallel workloads (threads) grows we have parallel threads within one code that are defined by code blocks, and then workloads in different processes with the same MPI application, to workloads separated  by an even greater logical distant in different applications, and finally to thread groups virtualized across OS environments.  
As Mark H. points out the functional units do not care whom is parent to its work.  Still, the problem of shepherding the result back to it proper pen grows with the distance of logical separation.  Hardware resources and chip surface area are required to manage this. That is one reason why Intel delayed SMT in Clovertown and Harpertown and why many-core advocates think that threads are a waste of chip space, especially in a data parallel universe.  
We wish to more fully utilize under utilized functional unit resources in a core of course, but the as Ashley P. intimated as we pile threads disproportionately on top of ever growing parallel hardware the chance of a delaying collision grows in a non-linear way.  Thus, the expanded variance.  Put another way the gap between trivial or random schedules through the hardware and optimal ones grows  (Like the distance between a pair and a straight flush in poker) as both workloads and resouces thread.  If we are allocating resouces based on sampling, we then run into the problem of not being able to discover where the idle resouces are.  This is visible in scaling tests of very fat server nodes using the VMmark benchmark.  Efficiency drops off even with benchmarks scaled weakly.  Are there better alternatives?  Well, at the instruction level, we have the VLIW -- prepacked workloads known not to intefer with each other. What about at the level of schedulers, which as I understand it ar
e all sampling based ... ?? There is the notion of resource-requirement-aware scheduling which intends to eliminate resource collisions in advance for virtualized work loads.  The Cray XMT uses hardware resources to insulate a large groups of parallel workloads (at the expense of individual or related ones sometimes) from interference that might idle useable resources if additional more or less distant work was not available.
This discussion invokes wild thoughts ... like the notion of compile multiple applications together in a cluster ... and running them together knowing that the compiler can shuffle the work together smartly with the need for additional hardware resources to do it.
Are folks hear familiar with eXludus' resource-requirement-aware scheduling technology?
Sorry about the length ... but it is an interesting topic.
Regards,
rbw
-- 

"Making predictions is hard, especially about the future." 

Niels Bohr 

-- 

Richard Walsh 
Thrashing River Consulting-- 
5605 Alameda St. 
Shoreview, MN 55126 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080227/f11276ff/attachment.html>