[Beowulf] Three notes from ISC 2006
patrick at myri.com
Wed Jun 28 08:05:33 PDT 2006
Greg Lindahl wrote:
> On Wed, Jun 28, 2006 at 07:28:53AM -0400, Patrick Geoffray wrote:
>> I have keep it quiet even when you where saying things driven by
>> marketing rather than technical considerations (the packet per
>> second nonsense),
> Patrick, that "packet per second nonsense" is the technical reason our
> interconnect does so well. If you'd like to argue about it,
> technically, I'd be happy to do so. No need to keep quiet.
My reservation was about the way you present it, not the technical idea
behind. Actually, my real concern was that there was no technical
content in your post, just references to white papers, ie marketing fluff.
So, let's finally talk about the technical part. You claim that the key
metric in your product is the messaging rate, ie the number of packets
you can send per second. You even have a fancy name for it, something
like Hyper Duper Messaging :-)
From the infamous white papers that I have seen, it looks like you can
send 3 Million packets per second for one process for small packets. I
can send about 1 Million packets per second on D card with MX-1.2 (not
560 K as your company claim in the marketing material). I could maybe
double that if I work on it. Anyway, you say that this is why your
interconnect is so much better.
High message rate is good, but the question is how much is enough ? At 3
million packet per second, that's 0.3 us per message which all of it is
used by the communication library. Can you name real world applications
that need to send messages every 0.3 us in a sustained way ? I can't,
only benchmarks do that. At 1 million packet per second, that one
message per microsecond. When does the host actually compute something ?
Did you measure the effective messaging rates of some applications ?
From you flawed white papers, you compared your own results against
numbers picked from the web, using older interconnect with unknown
software versions. Comparing with Myrinet D cards for example, you have
4 times the link bandwidth and half the latency (actually, more like 1/5
of the latency because I suspect most/all Myrinet results were using GM
and not MX), but you say that it's the messaging rate that drives the
performance ??? I would suspect the latency in most cases, but you
certainly can't say unless you look at it.
So, the two only cases where high message rate make some sense is:
* personalized all-to-all: an application may need to send small packets
to many destinations after a computing phase. In this case, you want to
send them as fast as possible, obviously. But this burst of
communication is limited in size. Worst case is a naive personalized
all-to-all, ie one message per peer. how many messages is that, how
often in time ? Does that sum to 3 million packet per second ? I don't
think so. You also have to receive from the peers, and the skew between
processes is more likely to cost more that the time to send all of your
messages. My assertion is that 1 million packets per second is good
enough, I have never seen this metric being a bottleneck in any
application profiling I have done. Apparently, neither did the other
interconnect vendors (I don't expect Mellanox to look at these things,
but Quadrics people are no beginners). With interconnect doing real
offload, you will actually queue the messages and the NIC will process
them. The message rate for a limited burst is actually the memory copy
performance, as the MPI_Send for a small/medium message will return just
after the data is copied out of the application send buffer. The time it
takes for the NIC to process the sends asynchronously is not null, but
it is usually not the bottleneck either, compared to the synchronization
overhead of the all-to-all.
* high ratio of processes per NIC: that's actually the only good
argument. If you cannot increase the total message rate when you add
processes, then your message rate per process decrease. Again, the your
infamous marketing material is wrong: the bulk of the send overhead is
not in the NIC. Actually, the latency is wrong too (sources says a talk
from 2003, newsgroups and Pathscale estimations ?!?). For various
reasons (but not for any message rate consideration), we have moved the
reliability support from the NIC to the host, so the NIC overhead is now
much lower than the host overhead, which is dominated by the memory copy
for small messages. Would you have access to the source of the MX-1.2
library, you would have seen it (You will when it is available under the
widely known ftp.myri.com password, we always ship the source of our
libraries, not like other vendors :-o ).
So, the message rate does increase with the number of processes with
MX-1.2, but it will still be bounded by the NIC overhead, which is
likely more important with Myrinet (2G or 10G) that with your NIC.
However, this is the same question: how much is enough per process ?
Sharing one NIC for 8 cores is, IMHO, a bad idea. A ratio of one NIC for
2 core was quite standard the last decade. It may make economical sense
to increase this ratio to one NIC for 4 cores, but I would not recommand
to go higher than that. And with the number of cores going up (if people
actually buy many-cores configurations, the sweet spot is definitively
not at 8-way), it will make a lot of sense to use hybrid
shared-memory/interconnect for collective communications. In this
context, the message rate requirement of an all-to-all is not shared
Finally, you don't talk much about the side effects of your
architectural decisions, such as no little/no overlap and high CPU overhead.
The white papers are right on one thing: latency and bandwidth are not
enough to fully describe an interconnect. But message rate is just one
of the metrics, and I assert that it's not a particularly important one.
I suspect that Pathscale picked message rate as a marketing drum because
no other interconnects really cared about it. That's was the
differentiation bullet from the business workshop I attended.
More information about the Beowulf