[Beowulf] Re: Beowulf Digest, Vol 37, Issue 58
Håkon Bugge
Hakon.Bugge at scali.com
Thu Apr 12 02:37:26 PDT 2007
Hi Christian,
Sorry for this very delayed answer.
At 03:16 27.03.2007, Christian Bell wrote:
>I can't type, 482 was indeed a typo. But still, I wouldn't look at
>the absolute numbers "as is" since the single-node base case has
>different performance. Since 1x2x1 is our only common base case and
>since Scali is faster at 4212 versus 4863, the IB interconect you're
>testing should be achieving 416s instead of 550s to produce strong
>scaling similar in line with the 8x2x2 InfiniPath time to solution
>(at 482s).
Well, you do know Amdahl vs. Gustavson, right?
The dataset is fixed, elapsed time includes
initialization, write of animation files and
more. Hence, slower per node performance would
_scale_ better.
For this application field, crash worthiness
testing, most users keep the number of cores
constant throughout the duration of a project (12
- 18 mnths). This due to numerical stability and
verification thereof. Hence, the interesting
point is not how far and fast you could run, but
the cost of the system capable of running the
application instances at 60-80% parallel efficiency.
As to the RMDA vs. MP based interconnect
semantics, the problem I am phasing is that the
RDMA interconnect I am using is more or less
collapsing using 32 cores. Using alltoall with 1k
packet size, it actually perform worse than Gbe.
Sigh! (And please, do not turn this into a vendor
harassment, as I am pretty sure this has to do
with implementation and not architecture). So,
what I have shown is that an RDMA interconnect
performs faster than a message passing
interconnect which has roughly 3x lower latency
and 20x (?) higher message rate upto a scaling
point where the RDMA _implementation_ collapses.
And this _despite_ the fact the RDMA based MPI
has to perform the MPI message matching.
>With equal metrics/performance and phrased in this manner, it seems
>that RDMA still has to implement the semantics that message-passing
>already provides, which suggests in this case that the RDMA interface
>is at a loss. Maybe I'm missing something to your question...
I doubt you're missing anything;-) But let me
stress that as the number of cores per node
scale, a message passing semantics HCA with
message matching in the HCA will have a constant
message matching rate. An RDMA based MPI which
uses the cores for message matching, the message
matching rate would be almost proportional to the number of cores...
Håkon
--
Håkon Bugge
CTO
dir. +47 22 62 89 72
mob. +47 92 48 45 14
fax. +47 22 62 89 51
Hakon.Bugge at scali.com
Skype: hakon_bugge
Scali - http://www.scali.com
Scaling the Linux Datacenter
More information about the Beowulf
mailing list