[Beowulf] Q: IB message rate & large core counts (per node) ?

Fri Feb 26 10:20:49 PST 2010

On Feb 26, 2010, at 12:36 PM, richard.walsh at comcast.net wrote:

> 
> Mark Hahn wrote:
> 
> >> Doesn't this assume worst case all-to-all type communication
> >> patterns.
> >
> >I'm assuming random point-to-point communication, actually.
> 
> A sub-case of all-to-all (possibly all-to-all). So you are assuming
> random point-to-point is a common pattern in HPC ... mmm ... I
> would call it a worse case pattern, something more typical of 
> graph searching codes like they run at the NSA.  Sure a high
> radix switch (or better yet a global memory address space, Cray
> X1E) is good and designed for this worst-case, but not sure this
> is the common case data reference pattern in HPC ... if it were
> they would be selling more global memory systems at Cray and
> SGI (not just to the NSA).

Designing the communications network for this worst-case pattern has a
number of benefits:  

* it makes the machine less sensitive to the actual communications pattern
* it makes performance less variable run-to-run, when the job controller
chooses different subsets of the system

> 
> There you might also want a machine like the Cray XMT where
> the memory is flat and stalled threads can be switched out for
> another thread.  
> 
> >> If you are just trading ghost cell data with your neighbors
> >> and you have placed your job smartly on the torus the fan out
> >> advantage mentioned is irrelevant. No?

Smart placement is a lot harder than it appears.
* The actual communications pattern often doesn't match preconceptions
* Communications from concurrently running applications can interfere.

There's a paper in the IBM Journal of Research and Development about this,
they wound up using simulated annealing to find good placement on the most
regular machine around, because the "obvious" assignments weren't 
optimal.

...

In addition to this stuff, the quality of the interconnect has other effects

* a fast, low latency interconnect lets the application scale effectively to larger
numbers of nodes before performance rolls off
* an interconnect with low latency short messages provides a decent base for
PGAS languages like UPC and CoArray Fortran or for lightweight communications
APIs like SHMEM or active messages.

Personally, I believe our thinking about interconnects has been poisoned by thinking that NICs are I/O devices.  We would be better off if they were coprocessors.  Threads should be able to send messages by writing to registers, and arriving packets should activate a hyperthread that has full core capabilities for acting on them, and with the ability to interact coherently with the memory hierarchy from the same end as other processors.  We had started kicking this around for the SiCortex gen-3 chip, but were overtaken by events.

-Larry

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100226/4e8d408e/attachment.html>