[Beowulf] torus versus (fat) tree topologies

Wed Nov 10 01:37:56 PST 2004

Chris,

I have a view on the topic, having delivered professional software for both 
2D and 3D SCI torus topologies as well as for Gbe, Myri, and IB centralized 
switch topologies.

First, in such a discussion it is hard to separate _implementation_ from 
_architecture_. I would state that an implementation of a 2D/3D torus 
topology can have very short latencies. But why is it? Using SCI from 
Dolphin you would observe very short latencies, but this stems from the NIC 
SW/HW architecture, not the topology per se in my view. For example, a 
64-bit, 66MHz PCI with Dolphin NIC has lower latency than a modern PCI-e 
NIC from Mellanox, both measured with a direct cable (i.e. a two-node 
ringlet using SCI) using the same SW stack (Scali MPI Connect). [However, 
the payload does not have to grow very large before the lack of bandwidth 
becomes a hinder to latency. A 4x IB PCI-e one-way traffic of 1k payload is 
about 600MB/s, which, as a side note, is faster than Cray XD1 despite their 
claim being 2x faster that 4x IB on short messages ;-)]

However, a primary trend seems to make torus topologies less attractive. 
Although it is true that the bi-section bandwidth scales (not linear 
though) with the size of the system, a decent bi-section bandwidth requires 
the links which makes up the torus to be significant faster than the I/O of 
the compute nodes. For example, looking at a 1D torus (ring). How much 
total bandwidth is available for the attached nodes assuming uniformly 
distributed traffic? The answer is approximately 1.5 times the bandwidth of 
the individual link segments. Given the link-speed and the effective 
compute-node I/O bandwidth though the actual NIC, its simple arithmetic to 
calculate how many nodes it is applicable to have in each dimension of the 
torus. However, my observation is that the link speed of todays 
interconnects and the I/O speed of the nodes seems to get closer and 
closer. If this is true, I would claim that the applicability of torus 
topologies for systems with I/O bus attachment will become less attractive 
over time. The latter from a _bandwidth_ centric view.

Other factors in deciding the best suitability between the two topologies 
have to a large extent been commented. One issue though, is that on-site 
spare-parts are fewer and less expensive for tori, but this factor is of 
course most important for smaller system, measuring cost_of_spare_parts as 
a fraction of the total interconnect cost. Fault-tolerance cost could also 
be less expensive with a torus. If one random power supply breaks down in a 
torus, it must be that of a compute node, and the impact of that is 1/Nth 
of the system (assuming a decent run-time system which dynamically 
recalculates routes). If the power supply of a centralized switch breaks 
down, you loose the whole system. Of course this can be alleviated by 
(multiple) dual power-supplies, etc., but cost would typically be higher 
than in the torus case. Also, an argument in favour of a torus topology, 
could be linear incremental growth cost. Slightly exceeding no_of_ports 
available in a switch will sometimes significant increase the average cost 
per port, if full bi-section bandwidth is to be maintained. The obvious 
drawback of tori topologies is cabling, assuming the torus is implemented 
with two cables per dimension. You get significant more cables, implying 
longer deployment times and more complicated node replacement. In larger 
systems though, cabling of centralized switched tends to require very 
_long_ cables, something you do not need using tori topologies.

We have some interesting results of HPCC using same node hardware, same SW 
stack, for 2D SCI, Gbe, Myrinet, and IB. If interested, we can probably 
disclose these numbers to you.

  *
Hakon (Hakon.Bugge _ AT_ scali.com)