[Beowulf] Re: Beowulf Digest, Vol 37, Issue 58

Mon Mar 26 14:34:09 PDT 2007

Hi again Christian,

At 16:59 26.03.2007, Christian Bell wrote:

>Hi Håkon,
>
>I'm unsure if i would call significant a 
>submission comparing results between 
>configurations not compared at scale (in 
>appearance large versus small switch, much 
>heavier shared-memory component at small process 
>counts).  For example, in your submitted 
>configurations, the interconnect communication 
>(inter-node) is never involved more than shared 
>memory (intra-node) and when the interconnect 
>does become dominant at 32 procs, that's when InfiniPath is faster.

Not sure how you count this. In my "world", all 
processes communicates with more remote processes 
that local ones in all cases except for the 
single node runs. I.e., in a two node case with 2 
or 4 processes per node, a process has 1 or 3 
other local processes and 2 or 4 other remote 
processes. Excluding the single node cases, we 
have six runs (2x2, 4x2, 8x2, 2x4, 4x4, 8x4) 
where RDMA is faster than message passing in 5 of the cases.

As to the 32 core case, I am running equal fast 
as Infinipath on this one, but this is not a 
released product (yet). Hence I haven't published it.

And based on this I did not call it significant 
findings, but merely an indication of RDMA being 
faster (upto 16 cores) or equal fast as message 
passing for _this_ application and dataset.

>On the flip side, you're right that these 
>results show the importance of an MPI 
>implementation (at least for shared memory), 
>which also means your product is well positioned 
>for the next generation of node configurations 
>in this regard. However, because of the node 
>configurations and because this is really one 
>benchmark, I can't take these results as 
>indicative of general interconnect 
>performance.  Oh, and because you're forcing me 
>to compare results on this table, I now see what 
>Patrick at Myricom was saying -- the largest 
>config you show that stresses the interconnect 
>(8x2x2) takes 596s walltime on a similar 
>Mellanox DDR and 452s walltime on InfiniPath SDR 
>(yes, the pipe is "100%" smaller but the performance is 25% better).

Just to avoid any confusion, the 596s number is 
_not_ with Scali MPI Connect (SMC), but a 
competing MPI implementation. SMC achieves 551s 
using SDR. I must admit your Infinipath number is 
new to me, as topcrunch reports 482s for this configuration with Infinipath.

>We have performance engineers who gather this 
>type of data and who've seen these trends on 
>other benchmarks, and they'll be happy to right 
>any wrong misconceptions, I'm certain.
>
>Now I feel like I'm sticking my tongue out like 
>a shameless vendor and yet my original 
>discussion is not really about beating the 
>InfiniPath drum, which your reply insinuates.

Well, my intent was to draw the wulfers attention 
to some published facts containing 
apples-to-apples comparisons, in an interesting 
discussion of RDMA vs. message passing. Given the 
significant (yes, I mean it) difference in 
latency and message rates, I was indeed 
surprised. My question still is; if there existed 
an RDMA API with similar characteristics as the 
best message passing APIs, how would a good MPI implementation perform?

Håkon