[Beowulf] Re: Beowulf Digest, Vol 37, Issue 58

Mon Mar 26 07:59:56 PDT 2007

On Mon, 26 Mar 2007, H?kon Bugge wrote:

> Hi Christian,
> 
> At 01:19 24.03.2007, beowulf-request at beowulf.org wrote:
> >I've yet to see a significant number of message-passing applications
> >show that an RDMA offload engine, as opposed to any other messaging
> >engine, is a stronger performance determinant.  That's probably
> >because there are other equally important and desirable features
> >implemented in other messaging engines.
> 
> 
> I find this statement hard to justify from 
> available benchmark data. Looking at the LS-DYNA 
> neon_refined_revised submissions to 
> www.topcrunch.org, you can add one in favour of 
> RDMA ;-). Scali MPI Connect, utilizing SDR IB, 
> performs better than all comparable systems, 
> except for one case, where Infinipath is faster. 
> This is somewhat surprising to me, given the 
> latency and message rate advantage Infinipath has 
> compared to traditional IB. Therefore, let me use 
> this opportunity to stress that its not only the 
> interconnect architecture, but also the software 
> harnessing it (read MPI) that matters.

Hi Håkon,

I'm unsure if i would call significant a submission comparing results
between configurations not compared at scale (in appearance large
versus small switch, much heavier shared-memory component at small
process counts).  For example, in your submitted configurations, the
interconnect communication (inter-node) is never involved more than
shared memory (intra-node) and when the interconnect does become
dominant at 32 procs, that's when InfiniPath is faster.  On the flip
side, you're right that these results show the importance of an MPI
implementation (at least for shared memory), which also means your
product is well positioned for the next generation of node
configurations in this regard.  However, because of the node
configurations and because this is really one benchmark, I can't take
these results as indicative of general interconnect performance.  Oh,
and because you're forcing me to compare results on this table, I now
see what Patrick at Myricom was saying -- the largest config you show
that stresses the interconnect (8x2x2) takes 596s walltime on a
similar Mellanox DDR and 452s walltime on InfiniPath SDR (yes, the
pipe is "100%" smaller but the performance is 25% better).  We have
performance engineers who gather this type of data and who've seen
these trends on other benchmarks, and they'll be happy to right any
wrong misconceptions, I'm certain.

Now I feel like I'm sticking my tongue out like a shameless vendor
and yet my original discussion is not really about beating the
InfiniPath drum, which your reply insinuates.  Rather, I was trying
to point out that what curses MPI in its inability to semantically
match the interfaces designed for offload is also what makes MPI
effective on other grounds.  Namely that a receiver-driven model with
no remote completion guarantees leaves enough room for implementors
to provide efficient network performance in many, perhaps
non-conventional forms.  Distilling the MPI discussion into "cpu
overhead" is focusing on a very specialized (i.e. narrow) part of
solving the MPI problem, a problem for which RDMA offload is not
panacea.

    . . christian

-- 
christian.bell at qlogic.com
(QLogic SIG, formerly Pathscale)