[Beowulf] Three notes from ISC 2006

Wed Jun 28 12:45:30 PDT 2006

On Wed, 28 Jun 2006, Patrick Geoffray wrote:

> High message rate is good, but the question is how much is enough ? At 3 
> million packet per second, that's 0.3 us per message which all of it is 
> used by the communication library. Can you name real world applications 
> that need to send messages every 0.3 us in a sustained way ? I can't, 
> only benchmarks do that. At 1 million packet per second, that one 
> message per microsecond. When does the host actually compute something ? 
> Did you measure the effective messaging rates of some applications ? 
> From you flawed white papers, you compared your own results against 
> numbers picked from the web, using older interconnect with unknown 
> software versions. Comparing with Myrinet D cards for example, you have 
> 4 times the link bandwidth and half the latency (actually, more like 1/5 
> of the latency because I suspect most/all Myrinet results were using GM 
> and not MX), but you say that it's the messaging rate that drives the 
> performance ??? I would suspect the latency in most cases, but you 
> certainly can't say unless you look at it.

Hi Patrick --

I agree with you that the inverse of message rate, or the small
message gap in logP-derived models is a more useful way to view the
metric.  How much more important it is than latency depends on what
the relative difference is between your gap and your latency.  One
can easily construct a collective model using logP parameters to
measure expected performance.  When the latency is low enough, a gap
of 0.3 versus 1.0 makes a difference, even in collectives that can be
completed in logarithmic amount of steps.  The fact that a process
need not send 1 million messages per second is besides the point.  In
more than just the all-to-all case you cite, the per-message cost can
determine the amount of time you spend in many MPI communication
operations.  

> * high ratio of processes per NIC: that's actually the only good 
> argument. If you cannot increase the total message rate when you add 
> processes, then your message rate per process decrease. Again, the your 
> infamous marketing material is wrong: the bulk of the send overhead is 
> not in the NIC. Actually, the latency is wrong too (sources says a talk 
> from 2003, newsgroups and Pathscale estimations ?!?). For various 
> reasons (but not for any message rate consideration), we have moved the 
> reliability support from the NIC to the host, so the NIC overhead is now 
> much lower than the host overhead, which is dominated by the memory copy 
> for small messages. Would you have access to the source of the MX-1.2 
> library, you would have seen it (You will when it is available under the 
> widely known ftp.myri.com password, we always ship the source of our 
> libraries, not like other vendors :-o ).
> So, the message rate does increase with the number of processes with 
> MX-1.2, but it will still be bounded by the NIC overhead, which is 
> likely more important with Myrinet (2G or 10G) that with your NIC. 
> However, this is the same question: how much is enough per process ? 
> Sharing one NIC for 8 cores is, IMHO, a bad idea. A ratio of one NIC for 
> 2 core was quite standard the last decade. It may make economical sense 
> to increase this ratio to one NIC for 4 cores, but I would not recommend 
> to go higher than that. And with the number of cores going up (if people 
> actually buy many-cores configurations, the sweet spot is definitively 
> not at 8-way), it will make a lot of sense to use hybrid 
> shared-memory/interconnect for collective communications. In this 
> context, the message rate requirement of an all-to-all is not shared 
> among processes.

I'm not ready to put my stake in the ground in predicting how many
cores will drive each NIC in the near future.  The past decade didn't
have multi-core and upcoming price/performance points may warrant
employing more cores on each node.  Sure, a single core will always
be able to exploit full bandwidth on a single NIC.  However, strong
scaling and some of the more advanced codes that don't always operate
in areas of peak bandwidth can provide enough head room in available
bandwidth for other cores to use.  Even if 4 or 8 cores
oversubscribes a single NIC, why not use the cores if it so happens
that the communication patterns and message sizes still allow you to
improve your time-to-solution?  After all, time-to-solution is what
it's all about.  Sure, a second NIC will always help, but getting the
best performance out of a single NIC and maintaining scalable message
rates as the number of per-node cores increases is a useful metric.

> Finally, you don't talk much about the side effects of your 
> architectural decisions, such as no little/no overlap and high CPU overhead.

We can have a discussion on the correlation of programming models and
their impact on architecture.  Unfortunately, there's not much to say
here in terms of side effects relative to everyone's favorite
programming model -- MPI-1.  While it's undeniable that judicious use
of non-blocking operations on networks with offload engines can lead
to better effective performance, how this capability correlates to
applications that people are writing is the real question.  What's
unclear with this type of overlap is the performance/portability you
get in using more advanced MPI communication techniques. The actual
amount of potential communication and computation overlap varies from
vendor to vendor.  One way to fix this at the application level is to
make the computation adaptive to the amount of communication that can
be overlapped.  This is feasible, but is tricky and often written
only by MPI communication experts.  Plus, there's the fact that it's
in every vendor's interest to optimize the more basic (but less
exciting) MPI communication primitives to support all the existing
codes.  

Life with MPI-2 wouldn't solve the problem either.  Most vendors
choose to expose their offload engines through a generally usable
RDMA interface and have to face the fact that the MPI-2
passive/active model imposes a semantic mismatch and added
synchronization.  The point of one-sided is to remove much of the
implied synchronization you get with MPI-1 and allow applications
that have low synchronization requirements to benefit from pure data
transfers.  An architecture that allows overlap through RDMA
mechanisms can suit these applications very well, but the remaining
problem seems to be lining up an RMA standard that users can
understand and architectures can implement with low added costs.  

Even with MPI-1, much of the RDMA semantics have to be
retrofitted to implement MPI's matched ordered envelope model -- you
already know this and much research (and still much more!) has gone
into optimizing this retrofit.  What MPI needs is an RDMA mode so
people can fully exploit their hardware for the characteristics it
has.  In the mean time, people should visit other programming models
that have a tighter fit with RDMA like global address space
languages.  If that doesn't do, stick to the performance/portable
MPI-1 communication operations.  

> The white papers are right on one thing: latency and bandwidth are not 
> enough to fully describe an interconnect. But message rate is just one 
> of the metrics, and I assert that it's not a particularly important one. 
> I suspect that Pathscale picked message rate as a marketing drum because 
> no other interconnects really cared about it. That's was the 
> differentiation bullet from the business workshop I attended.

If you believe that logP-derived models can be useful to predict some
areas of interconnect performance, message rate (or small message
gap) is simply the missing parameter to the model once one has
latency and bandwidth metrics.  Of course, I can't be confident that
it adequately measures performance for all cluster sizes, message
sizes and communication patterns, but it is not just a futile
marketing metric.

cheers,

-- 
Christian Bell
christian.bell at qlogic.com