[Beowulf] Multirail Clusters: need comments

Mon Dec 5 16:17:47 PST 2005

At 11:33 5-12-2005 +0000, Ashley Pittman wrote:
>On Sun, 2005-12-04 at 13:13 -0500, Mark Hahn wrote:
>> so you're talking about 6 GB/s over quad-rail.  it's hard to 
>> imagine what you would do with that kind of bandwidth
>
>This is a very bold assertion...
>
>> only you can answer this.  my experience is that very few applications
>> need anything like that much bandwidth - I don't think anyone ever
saturated 
>> 1x quadrics on our alpha/elan3 clusters (~300 MB/s), and even on
>> opteron+myri-d, latency seems like more of an issue.
>
>Your alpha/elan3 clusters would have been quad CPU machines.
>
>> > Note: I found just one line in a CLRC Daresbury lab presentation about
>> > quadrail Quadrics on Alpha (probably QsNet-1?) Any update on
QsNet2/Eagle?
>> 
>> at least several qsnet2/elan4 clusters ran with dual-rail.  there seem
to be 
>> lots of dual-port IB cards out there, but I have no idea how many sites are
>> using both ports.
>> 
>> as far as I can tell, dual-rail is quite a specialized thing, simply
because
>> the native interconnects are pretty damn fast at 1x, and because when you 
>> double the number of switch ports, you normally _more_ than double the
cost.
>> this is mostly an issue when you hit certain thresholds related to the
switch
>> granularity of the fabric.  128x for quadrics, for instance.  once you
start
>> federating switches, you're swimming in cables.  it's often quite
reasonable
>> to go with not-full-bisection fabrics at this scale, but if you're doing 
>> multirail in the first place, that doesn't make sense!
>
>I can't speak for IB but with quadrics the second rail is *exactly* the
>same in terms of topology as the first one and hence the cost is double.
>not-full-bisection and federation relate to the size (number of ports)
>of the network, not the number of rails.  Each rail needs it's own host
>bus to plug into however (the bus is the bottleneck, not the network) so
>you need to have the right machine in the first place which may cost
>more money.

>The one point you have missed however with multi-rail is that network
>bandwidth is per *node* whereas number of CPU's per node is for the
>large part increasing.  1Gb/s seems like a lot (or at least it did) but
>put it in a 16 CPU machine and all of a sudden you have *less* per CPU
>bandwidth than you had seven years ago in your alpha/elan3.  Couple that
>with CPU's being n times faster to boot and all of a sudden multi-rail
>is starting to less pie-in-the-sky and more look like a good idea.

>It's true that it won't buy you latency, how could it?  Bandwidth for
>the most part however does what you would expect, it increases linearly

For many applications that get parallellized now, a little bit of latency 
is real important. Not so much to ship a lot of data, but simply to start and
stop processors quickly.

In many applications to parallellize them, it's important to do some things a 
couple of hundreds of times a second. 

After that follows of course again a big bandwdith flow, for example to do
matrix calculations or to multiply 2 big numbers.

We can definitely expect future processors to process huge amount of
gflops. I would not be amazed if long before 2010 we have 1 teraflop
processors in many supercomputers. I would
rather expect most 'supercomputers' to be clusters that can simply do
calculations at large
scale, in short having enough bandwidth from node to node.

An additional important requirement is quick synchronization during
bandwidth streaming.

With 16-32 processing cores a node or something, whatever type, you can
expect that there is
16-32 streams to each core. 

So that means that switch latency of network cards is important too.

I do realize that not a single manufacturer on this list likes to quote that 
switch latency, as it is usually real UGLY.

However, such seemingly tiny details will get important.

Because just calculate how much data a single core of say 350 gflop can
deliver.

16 x 350 gflop = 5.6 teraflop.

Of course that's just paper. Let's assume 2 teraflop effectively. If a
programmer can
achieve that in a program he's a big hero of course :)

2 * 10^12 calculations. I'll assume single precision now by the way.

Everyone here is always discussing double precision, but reality is that
single precision
simply goes so so much faster at the fast cheapo processors. And any FFT
you can make
either in single precision OR double precision. The extra overhead for
single precision
is not that much. About factor 2. Yet it allows real cheapo processors that
deliver all
together *huge* amounts of gflops.

For the bandwidth calculation of course single precision versus double
precision is not
real interesting too. It's just a factor 2.

The thing is, 2 * 10^12 calculations, assuming efficient reusage of the
caches and RAM within 1 node, it means that 1 node already has a total
*output* of 8 terabyte a second.

That's far beyond what any network delivers currently. It will get a major
problem.

Todays supercomputing simply isn't ready for that kind of bandwidth that a
single cell type processor can use there. There is simple examples.

Like from a 1 terabyte array i wanted to take the md5sum. Of course only a
single processor would take the md5 sum. The streaming from the i/o wasn't
the problem there. All those arrays easily can deliver speeds that are real
big. But practical even at an origin3800, the md5sum of that data went with
far under 10MB/s because the 500Mhz processors couldn't calculate it faster...

Of course this was a single operation and i only had to do it one time, but
it was real pathetic that it took days of calculation time. 

In general the problem with i/o is the bugs in the file system software
more than the speed of the i/o. Usually when doing big operations with
processors, it's possible to first do a lot of calculations before
streaming it to disk.

Yet the network will be a big problem if the number of glops a cpu is going
to advance as much as it looks like they will do now.

Add to that the hard fact that in past we had networks that were relative
tiny compared to the average cluster size of hundreds if not thousands of
nodes.

The budget simply is several tens of millions for the big systems. I'd say
at least each big IT nation should have a system of $20 million+ in future. 

Just calculate then what in future the total cpu power will be that
governments can afford in that respect. If you then do the calculation how
many petaflops networks need to adress, i am  sure that you guys find
clever solutions there to transport all that between nodes!

>with the number of rails.  There are some cases where this doesn't
>appear to hold true, for example given a 16*16 machine average bandwidth
>between two CPUS won't quite double as you double the number of rails
>because 15/256 ranks are local to any given process so will get linear
>bandwidth independent of the number of rails.  This however is simply a
>matter of understanding the topology of the machine.  Another odd case
>is broadcast, assuming the network can deliver into a 16 CPU machine at
>2GB/s shepherding this data inside the the node 16 distinct memory
>locations within that node at 2GB/s *each* isn't possible and the
>network ends up waiting for the node.
>
>The greatest number of rails I've ever seen in one machine was seven
>however this was a old alpha test cluster and was done as a proof of
>concept rather than a actual product, it only had two nodes.
>
>Ashley,
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
>