[Beowulf] Infiniband modular switches
Don Holmgren
djholm at fnal.gov
Fri Jun 13 12:05:22 PDT 2008
On Fri, 13 Jun 2008, Ramiro Alba Queipo wrote:
> On Thu, 2008-06-12 at 10:08 -0500, Don Holmgren wrote:
>> Ramiro -
>>
>> You might want to also consider buying just a single 24-port switch for your 22
>> nodes, and then when you expand either replace with a larger switch, or build a
>> distributed switch fabric with a number of leaf switches connecting into a
>> central spine switch (or switches). By the time you expand to the larger
>> cluster, switches based on the announced 36-port Mellanox crossbar silicon will
>> be available and perhaps per port prices will have dropped sufficiently to
>> justify the purchase delay and the disruption at the time of expansion.
>
> Could you explain me this solution? I did not know about it
As far as I know, all currently available commercial Infiniband switches are
based on the Mellanox 24-port non-blocking silicon switch chip (InfiniScale
III). The 96, 144, and 288 port modular switches from the various companies use
a number of these individual chips in a layered (3-hop) design that provides
full bisection bandwidth. One can also construct a full bisection bandwidth
144-port (say) switch out of twelve 24-port switches: out of the total 288
switch ports, 144 ports connect to nodes, and 144 ports connect to other switch
ports. The latency should be identical to that of a 144-port chassis, as
both would use three hops (disregarding the negligible ~ nanosecond per foot of
extra cable length delay when using 24-port switches).
Usually the per port cost for a large switch is less than the per port cost for
a bunch of 24-port switches. When you don't need a full 144-port switch, you
can either buy the large chassis and only buy a limited number of blades, or
go with a set of 24-port switches. For smaller networks a set of 24-port
switches is cheaper.
The next generation switch silicon will be 36 ports (InfiniScale IV), rather
than 24. Obviously I can't predict for certain that the large switches to
be built out of this silicon will be cheaper than the current models, but it
is reasonable to guess that this will be the case.
>
>>
>> If your applications can tolerate some oversubscription (less than a 1:1 ratio
>> of leaf-to-spine uplinks to leaf-to-node connections), a distributed switch
>> fabric (leaf and spine) has the advantage of shorter (and cheaper) cables
>> between the leaf switches and your nodes, and relatively fewer longer cables
>> from the leaves back to the spine, compared with a single central switch.
>
>
> What do you mean with a distributed switch fabric?
> What is the difference with a modular solution?
>
> Thanks for your answer
>
> Regards
I think both of these questions are answered above. But to be clear, by
"distributed" I mean that instead of one large switch chassis one would use
a number of 24-port switches. In this case it is very natural to put the
individual switches next to their nodes. See, for example, the "A New Approach
to Clustering - Distributed Federated Switches" white paper at the Mellanox
web site. When the switches are next to the nodes, the cable plant can be a lot
easier to deal with. Don't underestimate the pain of having 144 fairly hefty
Infiniband cables all terminating into a 10U chassis.
One additional item of note when using a distributed fabric: if your typical
jobs use a small number of nodes, then it is quite possible to configure your
batch scheduler so that the nodes belonging to an individual job all connect to
the same leaf switch. This means that your messages only have to go through one
switch hop, so latency is reduced compared with going through three hops in a
large modular switch chassis (although I seriously doubt that the quarter
microsecond of latency difference here matters to many codes). Perhaps of more
significance, though, is that you can use oversubscription to lower the cost of
your fabric. Instead of connecting 12 ports of a leaf switch to nodes and using
the other 12 ports as uplinks, you might get away with 18 nodes and 6 uplinks,
or 20 nodes and 4 uplinks. As core counts are increasing, this is becoming more
and more viable for some applications.
Don
More information about the Beowulf
mailing list