[Beowulf] Re: Beowulf Digest, Vol 37, Issue 58

Mon Mar 26 18:16:25 PDT 2007

On Mon, 26 Mar 2007, H?kon Bugge wrote:

> And based on this I did not call it significant 
> findings, but merely an indication of RDMA being 
> faster (upto 16 cores) or equal fast as message 
> passing for _this_ application and dataset.

My apologies Hakon, I misunderstood your intention.  I'm justifying
the fact that there are no "significant" plusses for RDMA offload as
a general statement: approaches that expose RDMA offload and those
that don't can remain competitive in a very large space of today's
popular applications.  I disagree with your conclusion that RDMA gets
the nod for this particular application and dataset.  More below..

> Just to avoid any confusion, the 596s number is 
> _not_ with Scali MPI Connect (SMC), but a 
> competing MPI implementation. SMC achieves 551s 
> using SDR. I must admit your Infinipath number is 
> new to me, as topcrunch reports 482s for this configuration with
> Infinipath.

I can't type, 482 was indeed a typo.  But still, I wouldn't look at
the absolute numbers "as is" since the single-node base case has
different performance.  Since 1x2x1 is our only common base case and
since Scali is faster at 4212 versus 4863, the IB interconect you're
testing should be achieving 416s instead of 550s to produce strong
scaling similar in line with the 8x2x2 InfiniPath time to solution
(at 482s).

In fact, both InfiniPath and your IB follow roughly the same scaling
curve until 32 processes (this is consistent with GregL's performance
data shown earlier today with HP-MPI's DDR results).  So at least
part of these results seem to be MPI-implementation and even data
rate agnostic -- something else is going on with the interconnect.

To sum up both the HP-MPI/DDR and Scali/SDR cases, I think it would
be fair (or at least worthy of mention) to note that DDR and RDMA
offload do *not* provide a significant increase in scalable
performance, quite the opposite.

> Well, my intent was to draw the wulfers attention 
> to some published facts containing 
> apples-to-apples comparisons, in an interesting 
> discussion of RDMA vs. message passing. Given the 
> significant (yes, I mean it) difference in 
> latency and message rates, I was indeed 
> surprised. 

Actually, this is the trend we've seen on many applications with
strong scaling.  With a fixed problem size and increasing the number
of processors, the message sizes tend to reduce and place more burden
on the interconnect, which is where message rate and latency are
likely to dominate over large message BW performance.

>            My question still is; if there existed 
> an RDMA API with similar characteristics as the 
> best message passing APIs, how would a good MPI implementation
> perform?

With equal metrics/performance and phrased in this manner, it seems
that RDMA still has to implement the semantics that message-passing
already provides, which suggests in this case that the RDMA interface
is at a loss.  Maybe I'm missing something to your question...

    . . christian

-- 
christian.bell at qlogic.com
(QLogic SIG, formerly Pathscale)