[Beowulf] Three notes from ISC 2006

Wed Jun 28 11:41:59 PDT 2006

Patrick,

  Thank you for the rapid and thoughtful response,

On Wed, 2006-06-28 at 11:23, Patrick Geoffray wrote:
> Hi Kevin,
> 
> Kevin Ball wrote:
> > Patrick,
> > 
> >>  
> >>  From you flawed white papers, you compared your own results against 
> >> numbers picked from the web, using older interconnect with unknown 
> >> software versions. 
> > 
> >   I have spent many hours searching to try to find application results
> > with newer Myrinet and Mellanox interconnects.  I would be thrilled (and
> > I suspect others might as well, but I'm only speaking for myself) if you
> > would take these white papers as a challenge and publish application
> > results with the latest and greatest hardware and software.
> 
> Believe it or not, but I really want to do that. I don't think it's 
> appropriate to compare results from other vendors though: in Europe, 
> it's forbidden to do comparative advertisement (ie the soap X washes 
> more white that the brand Y) and I completely agree with the rationale.

This is interesting.  I did not know this;  I do see the rationale,
though it complicates things as well:  Knowing what performance is good
and what is not is very difficult without something to compare to.

> 
> However, there is nothing wrong into publishing applications numbers 
> versus plain Ethernet for example, and let people put curves side by 
> side if they want. Or submit to application specific websites like 
> Fluent. I will do that as soon as I get a decent sized cluster for me (I 
> have a lot of small ones of various cpus/chipsets but my 64 nodes 
> cluster is getting old). Time is also a big problem right now, but I 
> will have more manpower in a couple of months. Time is really the 
> expensive resource.

I agree on the time point, and appreciate the difficulty in finding both
time and a large enough cluster to be interesting.

> 
> Most integrators have their own testbed and they do comparisons, but you 
> will never get these results, and even if you could have it, you could 
> not published it.
> 
> Recently, I have been thinking about something that you may like. With 
> motherboards with 4 good PCIE slots coming on the marketing (driven by 
> SLI and such), it could be doable to have a reasonably sized machine, 
> let's say 64 nodes, with 4 different interconnects in it. If Intel or 
> AMD (or any good will) would donate the nodes, and the interconnect 
> vendors would donate NICs + switch + cables, and a academic or 
> governmental entity would volunteer to host it, you could have a testbed 
> accessible by people to do benchmarking. The deal would be: you can use 
> the test bed but you have to allow your benchmark code to be available 
> to everyone and the code will be run on all interconnects and the 
> results public.

I like this idea a lot.  In fact, I've been pushing for us to get such a
cluster internally, but again the time and money questions come into
play, and an internal cluster has similar problems to the integrator
testbeds you mention.

  I have two large concerns.

  One is that finding a software stack that works with the latest
interconnect products may or may not correlate well with what end users
are interested in.  For some protocols (particularly MPI) this doesn't
seem to make a huge difference, though we have seen some effect.  For
others (particularly TCP/IP), there is a humongous difference between
different Linux kernels and distros.  Depending on what software was
decided upon, this might bias either for or against a solution that does
TCP offload, as compared to one that uses the standard Linux stack.

  The second concern is keeping up with N different release cycles in
terms of having things at the latest stable software version, and
firmware version (for products with firmware), and hardware... and how
this would interact with the above question of having a single supported
software stack.

  So in short... yes, I like the idea a lot, and I think it could
potentially get us into a better place than we are now in terms of
vendors and customers knowing how things compare.  However, there are
potential difficulties that without doing more research, I don't know
how much of a limitation they would put on the final result.

> What do you think of that ?

I'd support such an effort... I do wonder what would happen in terms of
marketing and/or vendor support if a situation like the last 3 years of
AMD/Intel were to arise for Interconnects.  If some vendors became
clearly technically inferior, would they withdraw support of the
project?

-Kevin

> 
> Patrick