Myrinet scalability

Patrick Geoffray patrick at myri.com
Thu Jun 20 00:52:38 PDT 2002


Hi Sergei, Ole,

Serguei Patchkovskii wrote:
> ----- Original Message ----- 
> Ole W. Saastad" <ole at scali.com> wrote:
> 
>>with this talk about scalability and switches I would like to
>>point out that the SCI interconnect uses no switch.
>>The only thing you need to add an extra compute nodes
>>and just recable the cluster. The cost increases linearly with
>>the number of nodes. There are no step costs when you must buy
>>more switch ports. 
> 
> 
> While this sounds more attractive than Myrinet in theory, the practice
> may (or may not) turn out to be a little bit different: The number of 


I took the initiative to forwarded this interesting thread to my boss, 
Chuck Seitz, CEO of Myricom. I know he likes this topic a lot, mainly 
because he was already working on that 10 years ago at CalTech. Here is 
  his input. I have "sanitized" it, removing NDA-covered information and 
non-scientific comments :-)
I think it's an interesting contribution to this thread.
---

The respondent, Serguei, is arguing correctly but too narrowly.  The
Scali person claims linear-cost scaling, but with a network with
constant total capacity (i.e., bisection).  It's pretty easy to achieve
linear cost that way ;-), but he ignores 50 years of research and
experience in concurrent computing and networks.  Myrinet Clos networks
scale up the network capacity with the number of nodes (full-bisection
Clos networks), a different and far more desirable form of scaling.

Serguei points out the defect in the first person's argument by noting
that 6-10 is a practical limit on the radix (k) of a k-ary n-cube
(dimension n) routing network* for his applications.  Thus, one hits
discontinuities at 6, 36, 216, ... (or 10, 100, 1000, ...) hosts.  Note
that at the "36" discontinuity, the network cost per host doubles; at
the 216 discontinuity, the network cost per host increases by 3/2, ...
Thus, Serguei forces the SCI network into scaling the network capacity
(bisection) with the number of hosts.  Of course, the discontinuities in
this scaling are quite early and nasty.

[* For a description of this topology and its relationship to other
topologies, see: Charles L. Seitz, "Concurrent VLSI Architectures,"
_IEEE Transactions on Computers_ C-33(12): 1247--1265, invited paper for
the issue celebrating the centennial of the founding of the IEEE,
December 1984.]

Myrinet networks also have discontinuities in the network cost per host.
All optimal networks in which the bisection scales with the number of
hosts (N) have an N*log(N) cost asymptotically in abstract metrics.

Our engineering can reasonably be judged by how well we
handles the inevitable discontinuities in the real (non-asymptotic)
world, but one should neglect the discontinuities that occur because
Myrinet (for packaging and chip economy) has 8 host ports per line card.
The significant discontinuities occur because the Clos network (a
particular full-bisection network whose cost is asymptotically N*log(N))
of 16-port switches (a nearly optimal point today for technology
reasons) is of different diameter for different numbers of hosts:

     N           diameter (# of switches traversed)

     2           0

    8,16         1  \  cost (list prices) in this range
  24...128       3  /  averages ~$400 per host port

  129-1024       5  \  cost (list prices) in this range
1025-8192       7  /  averages ~$1,200 per host port

The sharp discontinuity (3x in the network cost per host) between the
3-128 range and the 129-8192 range can actually be smoothed by using
dimater=4 almost-full-bisection networks for, for example N=192.  The
discontinuity is also smoothed in practice by discounts for large (N >
128) installations.

However, what is most remarkable is that "linear-cost" scaling (constant
cost per host) is maintained over these two broad ranges while providing
full bisection (a network capacity that scales with the number of
hosts).  The underlying reason for so large a discontinuity between
diameter 3 and diameter 5 is that for N <= 128, we can keep the entire
network inside a box, whereas for 8192 <= N < 128, one of the Clos
"spreader" networks must be implemented with fiber cables.  It's really
the external (fiber) ports, not the switching, that is expensive.

If Myricom were building Clos networks of 32-port switches, the major
(diameter 3 -> 5) discontinuity would occur at 512 hosts.  It will also
help the costs in both ranges to drive the port cost down.

One factor that you should not miss is that more and more of the market 
for cluster interconnect will be for *high-availability* applications. 
The k-ary n-cube is arguably the worst topology for HA, particularly for 
small N and n.  The Clos network is arguably the best topology for HA 
(due to the multiplicity of paths between hosts).

---

Hope it helps.

Patrick

----------------------------------------------------------
|   Patrick Geoffray, Ph.D.      patrick at myri.com
|   Myricom, Inc.                http://www.myri.com
|   Cell:  865-389-8852          685 Emory Valley Rd (B)
|   Phone: 865-425-0978          Oak Ridge, TN 37830
----------------------------------------------------------




More information about the Beowulf mailing list