message rates and MPI [Was Re: [Beowulf] Three notes from ISC 2006]
christian.bell at qlogic.com
Thu Jun 29 13:50:59 PDT 2006
And on we go...
On Wed, 28 Jun 2006, Patrick Geoffray wrote:
> I totally agree that the gap (g) gets important when the Latency (L) is
> small, but only when you send multiple messages in a row. When sending
> one message at a time, it's irrelevant (if the time between messages or
> even the send overhead (o.s) is larger than the gap, and the we are
> talking fraction of microsecond here). When sending multiple message,
> the gap is the bottleneck only if it's bigger than the send overhead.
> However, the send overhead is per process, where the gap is per NIC. So,
> for the gap to be the bottleneck, you need to send many message in a
> raw, and most likely from multiple processes at the same time. That's
> why the good argument for a small gap is with many cores sending at the
> same time.
Right, but I think that multiple cores sending at the same time is
not necessarily a rare occurrence, neither is sending multiple
consecutive messages since that occurs in collectives (more on this
below). The fact that the gap is a per-NIC parameter is where logP
isn't perfect on modern interconnects. There have been newer models
proposed for small message InfiniBand to take into account some
concurrency provided by the NIC (see LoP by Hoefler). The intuition
there is that a simple gap parameter misses the fact that two
consecutive messages can be cheaper than two messages spaced in time.
I find that similar observations can be made on architectures that
accommodate multiple cores and keep a low gap. Boasting a high
message rate really points at this metric.
> With everything (reliability, etc) done in the NIC, I would not be
> surprised if the NIC-level gap is indeed larger than the send overhead,
> even with small number of processes. In the GM era, that would have been
> huge. With the early MX releases, the firmware code was much smaller,
> but still doing reliability. We have started to move a lot from the NIC
> to the host, mainly to remove state from the NIC so that we can reboot
> the NIC live to do NIC failure recovery. The side effect is that the
> NIC-level gap is much smaller. With MX-1.2, the send gap is about 0.5
> us, and using PCI-Express specific optimization could reduce it to 0.25 us.
I looked up the GM numbers I had measured a few years ago -- o_s was
0.7us and gap was 7us. There's definitely a good case for sending
many small messages if possible, until you reach saturation in the
Myrinet send fifo.
> When you send multiple message, you don't send all of the them to the
> same peers (1->N pattern), so only the gap on the send side should be
> considered (the send gap is almost always smaller than the receive gap
> for small messages). However, using a streaming test measure the message
> rate that would be bounded by the gap on the receive side, not the gap
> on the send side. That's one problem with the streaming test.
I'm not sure I'd really look closely at what you call receiver-side
gap. The network will apply backpressure so that the receiving
endpoint will only see packets arriving at line rate. All
NICs should be able to receive packets at line rate, or else
something is seriously broken. So that said, the "receiver gap" is
simply the inter-packet gap. You are probably a step further in
looking at how the NIC will ensure reliability over N peers and
eventually get the message delivered at the destination (DMA, pio,
etc). How the received message is delivered from the NIC to the
end-user buffer is an artifact of the programming model and what
probably best be described using another metric (i.e. o_r).
> In the case of a N->1 pattern, the receive gap is definitively the
> bottleneck, specially when there is multiple process on the receive
> side. However, that assume that these messages arrives at the same time,
> and this is where I have never seen such scenario in real life because
> the asynchronism between senders is way bigger than the receive gap,
> unless the senders are continuously blasting the receiver and only
> synthetic benchmarks do that.
This pattern is interesting because it really gets at the heart of
the architecture and the cost of ensuring reliability over N endpoints.
If you have a connection-oriented model and maintain reliability in
the NIC (I know you don't, but others do), the NIC should minimize
the amount of state required to maintain reliability, which can be
important at large node counts. Mellanox has had issues with this,
as demonstrated by the introduction of "memfree" cards.
> In short, message rate is important only when several processes (running
> on several cores) are sending or receiving messages at the same time, ie
> doing tightly synchronous collective operations. Do you see a problem in
> my logic ?
I follow you, but I think synchronous collectives is only part of
what benefits from good message rates. Any form of nearest-neighbour
communication and really any multi-core concurrent point-to-point
communication sees a benefit. And if you increase the amount of
cores per node, you're just increasing the potential for concurrent
> From personal experience, collective operations are not that
> synchronous, there is always delay between the processes joining the
> collectives, at least delays larger than the receive gap. That's why
> LogP-derived models are not terribly successful to predict collective
> operations. They always add a factor that they call contention or
> synchronization noise.
Whether messages from multiple cores really line up in time to
benefit from higher achievable messaging rates becomes application
and environment specific. Yes there is a lot of "dead time" in
collectives, which is why there exists proposals for non-blocking
collectives. Alternatively, you can lower your latency and minimize
the amount of dead time. The flip side though is that minimizing the
dead time in the collectives will end up increasing the rate of
messaging in the collective operation itself. Now we're back to
maintaining good message rates.
> metric. My point is that it's a lot of ifs and other metrics such as
> latency apply to a much larger spectrum of communication pattern. I
> don't understand why you focus on message rate when your latency is
> really your strong value.
As I said above with the case of collectives, the minute you improve
your latency, the messaging rate becomes important. They are not
> Yes it's tricky to do overlap right. But if you can extract it from your
> code, it has by far the biggest potential in scalability improvement. By
> overlap, I don't really mean one-sided, I think that's too hard to
> leverage (or I don't have the right brain to use it correctly). However,
> split communications with MPI_Isend/MPI_Irecv/MPI_Wait can take you far
> if you really can efficiently use the processor between the initiation
> and the blocking completion. That's still basic primitives. If overlap
> is not a requirement, then using host CPU for everything is just fine.
> However, you cannot have both, it's a design choice.
This is where I can point the finger at microbenchmarks that measure
the amount of potential overlap and cpu overhead. The ability to
extract overlap is very workload specific, and I totally agree that
if you have an application with less data dependencies and the
opportunity and skill and time to expose it to overlap, you'll get a
benefit. I did this as part of my research and got 2x speedups on
some interconnects and the more we scaled, the better the performance
got. However, the data sets provided an almost ideal case of
exploiting a lack of local data dependencies (FFTs) and used a
different programming model (UPC) that essentially communicates with
RDMA without any implied or hidden synchronization overheads.
But in MPI, even if you craft your way through advanced asynchronous
operations, what really matters at the end of the day is how time
much you spend in the MPI library. It's also my belief that the
"craftier" you get with these operations, the more you tie your MPI
code to a particular interconnect technology.
In summary, I won't argue that latency and bandwidth give you only a
high-level characterization of an interconnect. I do think that
messaging rate, or really a form of approximated 'gap' for the LogP
model, helps paint a better picture of a given architecture. Gap is
still point-to-point and yes, LogP fails to adequately capture
collectives, something many people have shown. However, it's pretty
clear that the gap (and hence messaging rate) is strongly correlated
to the latency on both point-to-point and collective operations.
> I like this thread, don't you ? I wasted tons of precious time, but
> that's is what I want to see on this list, that's not marketing fluff,
> even if half of the recipient may have some pain to follow it :-)
Sure, it's a healthy thread. It would be nice if somebody from the
labs could contribute. They spend a lot of time writing papers and
microbenchmarks, so they usually are opinionated on these topics ;).
. . christian
christian.bell at qlogic.com
More information about the Beowulf