[Beowulf] torus versus (fat) tree topologies
Håkon Bugge
Hakon.Bugge at scali.com
Wed Nov 10 01:37:56 PST 2004
Chris,
I have a view on the topic, having delivered professional software for both
2D and 3D SCI torus topologies as well as for Gbe, Myri, and IB centralized
switch topologies.
First, in such a discussion it is hard to separate _implementation_ from
_architecture_. I would state that an implementation of a 2D/3D torus
topology can have very short latencies. But why is it? Using SCI from
Dolphin you would observe very short latencies, but this stems from the NIC
SW/HW architecture, not the topology per se in my view. For example, a
64-bit, 66MHz PCI with Dolphin NIC has lower latency than a modern PCI-e
NIC from Mellanox, both measured with a direct cable (i.e. a two-node
ringlet using SCI) using the same SW stack (Scali MPI Connect). [However,
the payload does not have to grow very large before the lack of bandwidth
becomes a hinder to latency. A 4x IB PCI-e one-way traffic of 1k payload is
about 600MB/s, which, as a side note, is faster than Cray XD1 despite their
claim being 2x faster that 4x IB on short messages ;-)]
However, a primary trend seems to make torus topologies less attractive.
Although it is true that the bi-section bandwidth scales (not linear
though) with the size of the system, a decent bi-section bandwidth requires
the links which makes up the torus to be significant faster than the I/O of
the compute nodes. For example, looking at a 1D torus (ring). How much
total bandwidth is available for the attached nodes assuming uniformly
distributed traffic? The answer is approximately 1.5 times the bandwidth of
the individual link segments. Given the link-speed and the effective
compute-node I/O bandwidth though the actual NIC, its simple arithmetic to
calculate how many nodes it is applicable to have in each dimension of the
torus. However, my observation is that the link speed of todays
interconnects and the I/O speed of the nodes seems to get closer and
closer. If this is true, I would claim that the applicability of torus
topologies for systems with I/O bus attachment will become less attractive
over time. The latter from a _bandwidth_ centric view.
Other factors in deciding the best suitability between the two topologies
have to a large extent been commented. One issue though, is that on-site
spare-parts are fewer and less expensive for tori, but this factor is of
course most important for smaller system, measuring cost_of_spare_parts as
a fraction of the total interconnect cost. Fault-tolerance cost could also
be less expensive with a torus. If one random power supply breaks down in a
torus, it must be that of a compute node, and the impact of that is 1/Nth
of the system (assuming a decent run-time system which dynamically
recalculates routes). If the power supply of a centralized switch breaks
down, you loose the whole system. Of course this can be alleviated by
(multiple) dual power-supplies, etc., but cost would typically be higher
than in the torus case. Also, an argument in favour of a torus topology,
could be linear incremental growth cost. Slightly exceeding no_of_ports
available in a switch will sometimes significant increase the average cost
per port, if full bi-section bandwidth is to be maintained. The obvious
drawback of tori topologies is cabling, assuming the torus is implemented
with two cables per dimension. You get significant more cables, implying
longer deployment times and more complicated node replacement. In larger
systems though, cabling of centralized switched tends to require very
_long_ cables, something you do not need using tori topologies.
We have some interesting results of HPCC using same node hardware, same SW
stack, for 2D SCI, Gbe, Myrinet, and IB. If interested, we can probably
disclose these numbers to you.
*
Hakon (Hakon.Bugge _ AT_ scali.com)
More information about the Beowulf
mailing list