[Beowulf] Three notes from ISC 2006

Wed Jun 28 08:05:33 PDT 2006

Greg Lindahl wrote:
> On Wed, Jun 28, 2006 at 07:28:53AM -0400, Patrick Geoffray wrote:
> 
>> I have keep it quiet even when you where saying things driven by
>> marketing rather than technical considerations (the packet per
>> second nonsense),
> 
> Patrick, that "packet per second nonsense" is the technical reason our
> interconnect does so well. If you'd like to argue about it,
> technically, I'd be happy to do so. No need to keep quiet.

My reservation was about the way you present it, not the technical idea 
behind. Actually, my real concern was that there was no technical 
content in your post, just references to white papers, ie marketing fluff.

So, let's finally talk about the technical part. You claim that the key 
metric in your product is the messaging rate, ie the number of packets 
you can send per second. You even have a fancy name for it, something 
like Hyper Duper Messaging :-)

 From the infamous white papers that I have seen, it looks like you can 
send 3 Million packets per second for one process for small packets. I 
can send about 1 Million packets per second on D card with MX-1.2 (not 
560 K as your company claim in the marketing material). I could maybe 
double that if I work on it. Anyway, you say that this is why your 
interconnect is so much better.

High message rate is good, but the question is how much is enough ? At 3 
million packet per second, that's 0.3 us per message which all of it is 
used by the communication library. Can you name real world applications 
that need to send messages every 0.3 us in a sustained way ? I can't, 
only benchmarks do that. At 1 million packet per second, that one 
message per microsecond. When does the host actually compute something ? 
Did you measure the effective messaging rates of some applications ? 
 From you flawed white papers, you compared your own results against 
numbers picked from the web, using older interconnect with unknown 
software versions. Comparing with Myrinet D cards for example, you have 
4 times the link bandwidth and half the latency (actually, more like 1/5 
of the latency because I suspect most/all Myrinet results were using GM 
and not MX), but you say that it's the messaging rate that drives the 
performance ??? I would suspect the latency in most cases, but you 
certainly can't say unless you look at it.

So, the two only cases where high message rate make some sense is:

* personalized all-to-all: an application may need to send small packets 
to many destinations after a computing phase. In this case, you want to 
send them as fast as possible, obviously. But this burst of 
communication is limited in size. Worst case is a naive personalized 
all-to-all, ie one message per peer. how many messages is that, how 
often in time ? Does that sum to 3 million packet per second ? I don't 
think so. You also have to receive from the peers, and the skew between 
processes is more likely to cost more that the time to send all of your 
messages. My assertion is that 1 million packets per second is good 
enough, I have never seen this metric being a bottleneck in any 
application profiling I have done. Apparently, neither did the other 
interconnect vendors (I don't expect Mellanox to look at these things, 
but Quadrics people are no beginners). With interconnect doing real 
offload, you will actually queue the messages and the NIC will process 
them. The message rate for a limited burst is actually the memory copy 
performance, as the MPI_Send for a small/medium message will return just 
after the data is copied out of the application send buffer. The time it 
takes for the NIC to process the sends asynchronously is not null, but 
it is usually not the bottleneck either, compared to the synchronization 
overhead of the all-to-all.

* high ratio of processes per NIC: that's actually the only good 
argument. If you cannot increase the total message rate when you add 
processes, then your message rate per process decrease. Again, the your 
infamous marketing material is wrong: the bulk of the send overhead is 
not in the NIC. Actually, the latency is wrong too (sources says a talk 
from 2003, newsgroups and Pathscale estimations ?!?). For various 
reasons (but not for any message rate consideration), we have moved the 
reliability support from the NIC to the host, so the NIC overhead is now 
much lower than the host overhead, which is dominated by the memory copy 
for small messages. Would you have access to the source of the MX-1.2 
library, you would have seen it (You will when it is available under the 
widely known ftp.myri.com password, we always ship the source of our 
libraries, not like other vendors :-o ).
So, the message rate does increase with the number of processes with 
MX-1.2, but it will still be bounded by the NIC overhead, which is 
likely more important with Myrinet (2G or 10G) that with your NIC. 
However, this is the same question: how much is enough per process ? 
Sharing one NIC for 8 cores is, IMHO, a bad idea. A ratio of one NIC for 
2 core was quite standard the last decade. It may make economical sense 
to increase this ratio to one NIC for 4 cores, but I would not recommand 
to go higher than that. And with the number of cores going up (if people 
actually buy many-cores configurations, the sweet spot is definitively 
not at 8-way), it will make a lot of sense to use hybrid 
shared-memory/interconnect for collective communications. In this 
context, the message rate requirement of an all-to-all is not shared 
among processes.

Finally, you don't talk much about the side effects of your 
architectural decisions, such as no little/no overlap and high CPU overhead.

The white papers are right on one thing: latency and bandwidth are not 
enough to fully describe an interconnect. But message rate is just one 
of the metrics, and I assert that it's not a particularly important one. 
I suspect that Pathscale picked message rate as a marketing drum because 
no other interconnects really cared about it. That's was the 
differentiation bullet from the business workshop I attended.

Patrick