Beowulfs can compete with Supercomputers [was Beowulf: A theorical approach]

Sat Jun 24 16:16:44 PDT 2000

Greg Lindahl wrote:
> 
> > > > Have you guys had any experience with comparing the performance of
> > > > similar codes on an SP versus a fully distributed cluster with similar
> > > > performance and number of processors?
> > >
> > > No -- there no Alpha slow enough for such a comparison ;-)
> > >
> >
> > What models have you benchmarked?
> > Power3II is quite fast, even compared to a 21264.
> 
> No, it's considerably slower. Look at the SPEC95fp results. 

Ah yes, SPEC, references to it in a minute.

> As for the
> graph, it's
> 
> http://www.mmm.ucar.edu/mm5/mpp/helpdesk/20000106.html
> 
> The graph is a bit misleading; the Compaq system that apparently is faster
> than mine at the same clock has DDR Sram, which I now offer. The extremely
> slow IBM SP result is a 375 mhz Power3.
> 

To clarify, the P3II is SP-WH2, second only to Alphas at
larger numbers of processors, with this code.

Interesting, but, the SC667 is no beowulf. Neither are 
almost all the machines on the chart except that little
PIII line on the bottom. That pretty much takes us out
of the realm of this group but to continue for a short bit.
It is a realm of computing I'm interested in.

First, a plot of linear speed up would put into perspective
exactly how this code scales. That tail off tells me why
you're interested in high speed interconnects, eg., myrinet. 
You would be better off running as two 128 node systems 
than as a 256 node system, assuming the problem can be solved 
on 128 nodes. Some of them probably can't so you suffer. Four 
64 node systems would be even better as would eight 32 node 
systems from a thruput point of view.

This leads us into comparing system interconnects. As an
example of SPEC in action, vs this chart, the SGI O2 400
manages to outperform the SP WH2 and matches the ACL/667, 
even though SPEC says it shouldn't. That is most likely
due to this code not being able to keep the cpus busy doing
useful work. In other words, the interconnect is too slow
for the processor. So, you have options. First, live with 
it. Second, alter the algorithm to require less communication.
(no easy task) Or, third, look for faster interconnects.
If you want improvement, the shortest route to speedups would
probably be to buy a faster interconnect.

Mostly this chart says to me this problem has a substantial
communication portion to it. I'd be interested in a plot of 
average cpu utilization during this code run to see exactly 
how well it used the cpus on each architecture. Unless, it uses 
a spin loop waiting on the interconnect to improve latency. 
In that case, it's difficult to see what's really going on.

One last comment. In the notes section, the SC667 is actually
using only one or two cpus per 4 cpu node. That would indicate
the SC667 nodes run out of bandwidth somewher since they chose not 
to post the 4 processors per node run. The WH2 runs use all 4 cpus 
on each node. So, basically you pay for twice as much machine to 
get that level of performance. That conclusion is also supported 
by the streams results for an ES40 which are the nodes in an SC.
It runs out of memory bandwidth. Changing the number of processors 
used on a node can impact both processor performance (memory) and 
interconnect bandwidth per processor. To put it bluntly, looks to 
me like Compaq was less than honest in their runs. No one will buy 
a system and use it that way, at least not a real company that would 
use the system to make money.

Wes