[Beowulf] Three notes from ISC 2006

Wed Jun 28 15:20:03 PDT 2006

Salut Christian,

Christian Bell wrote:
> I agree with you that the inverse of message rate, or the small
> message gap in logP-derived models is a more useful way to view the
> metric.  How much more important it is than latency depends on what
> the relative difference is between your gap and your latency.  One
> can easily construct a collective model using logP parameters to
> measure expected performance.  When the latency is low enough, a gap
> of 0.3 versus 1.0 makes a difference, even in collectives that can be
> completed in logarithmic amount of steps.  The fact that a process
> need not send 1 million messages per second is besides the point.  In
> more than just the all-to-all case you cite, the per-message cost can
> determine the amount of time you spend in many MPI communication
> operations.

I totally agree that the gap (g) gets important when the Latency (L) is 
small, but only when you send multiple messages in a row. When sending 
one message at a time, it's irrelevant (if the time between messages or 
even the send overhead (o.s) is larger than the gap, and the we are 
talking fraction of microsecond here). When sending multiple message, 
the gap is the bottleneck only if it's bigger than the send overhead. 
However, the send overhead is per process, where the gap is per NIC. So, 
for the gap to be the bottleneck, you need to send many message in a 
raw, and most likely from multiple processes at the same time. That's 
why the good argument for a small gap is with many cores sending at the 
same time.

With everything (reliability, etc) done in the NIC, I would not be 
surprised if the NIC-level gap is indeed larger than the send overhead, 
even with small number of processes. In the GM era, that would have been 
huge. With the early MX releases, the firmware code was much smaller, 
but still doing reliability. We have started to move a lot from the NIC 
to the host, mainly to remove state from the NIC so that we can reboot 
the NIC live to do NIC failure recovery. The side effect is that the 
NIC-level gap is much smaller. With MX-1.2, the send gap is about 0.5 
us, and using PCI-Express specific optimization could reduce it to 0.25 us.

When you send multiple message, you don't send all of the them to the 
same peers (1->N pattern), so only the gap on the send side should be 
considered (the send gap is almost always smaller than the receive gap 
for small messages). However, using a streaming test measure the message 
rate that would be bounded by the gap on the receive side, not the gap 
on the send side. That's one problem with the streaming test.

In the case of a N->1 pattern, the receive gap is definitively the 
bottleneck, specially when there is multiple process on the receive 
side. However, that assume that these messages arrives at the same time, 
and this is where I have never seen such scenario in real life because 
the asynchronism between senders is way bigger than the receive gap, 
unless the senders are continuously blasting the receiver and only 
synthetic benchmarks do that.

In short, message rate is important only when several processes (running 
on several cores) are sending or receiving messages at the same time, ie 
doing tightly synchronous collective operations. Do you see a problem in 
my logic ?

 From personal experience, collective operations are not that 
synchronous, there is always delay between the processes joining the 
collectives, at least delays larger than the receive gap. That's why 
LogP-derived models are not terribly successful to predict collective 
operations. They always add a factor that they call contention or 
synchronization noise.

> be able to exploit full bandwidth on a single NIC.  However, strong
> scaling and some of the more advanced codes that don't always operate
> in areas of peak bandwidth can provide enough head room in available
> bandwidth for other cores to use.  Even if 4 or 8 cores
> oversubscribes a single NIC, why not use the cores if it so happens
> that the communication patterns and message sizes still allow you to
> improve your time-to-solution?  After all, time-to-solution is what
> it's all about.  Sure, a second NIC will always help, but getting the
> best performance out of a single NIC and maintaining scalable message
> rates as the number of per-node cores increases is a useful metric.

Sure, no problem here. If you have a lot of cores and if a lot of 
processes are sending or receiving at exactly the same time, and if you 
do not oversubscribe the link bandwidth, then message rate is you 
metric. My point is that it's a lot of ifs and other metrics such as 
latency apply to a much larger spectrum of communication pattern. I 
don't understand why you focus on message rate when your latency is 
really your strong value.

> be overlapped.  This is feasible, but is tricky and often written
> only by MPI communication experts.  Plus, there's the fact that it's
> in every vendor's interest to optimize the more basic (but less
> exciting) MPI communication primitives to support all the existing
> codes.  

Yes it's tricky to do overlap right. But if you can extract it from your 
code, it has by far the biggest potential in scalability improvement. By 
overlap, I don't really mean one-sided, I think that's too hard to 
leverage (or I don't have the right brain to use it correctly). However, 
split communications with MPI_Isend/MPI_Irecv/MPI_Wait can take you far 
if you really can efficiently use the processor between the initiation 
and the blocking completion. That's still basic primitives. If overlap 
is not a requirement, then using host CPU for everything is just fine. 
However, you cannot have both, it's a design choice.

I like this thread, don't you ? I wasted tons of precious time, but 
that's is what I want to see on this list, that's not marketing fluff, 
even if half of the recipient may have some pain to follow it :-)

Patrick
-- 
Patrick Geoffray
Myricom, Inc.
http://www.myri.com