[Beowulf] Three notes from ISC 2006
see_signature_for_reply-to at ccrl-nece.de
Wed Jun 28 09:59:56 PDT 2006
Patrick Geoffray wrote:
> Greg Lindahl wrote:
>> On Wed, Jun 28, 2006 at 07:28:53AM -0400, Patrick Geoffray wrote:
>>> I have keep it quiet even when you where saying things driven by
>>> marketing rather than technical considerations (the packet per
>>> second nonsense),
>> Patrick, that "packet per second nonsense" is the technical reason our
>> interconnect does so well. If you'd like to argue about it,
>> technically, I'd be happy to do so. No need to keep quiet.
> My reservation was about the way you present it, not the technical idea
> behind. Actually, my real concern was that there was no technical
> content in your post, just references to white papers, ie marketing fluff.
An offer for "getting a secret white paper on request" is marketing, you are
right. But at least the SPEC number was technical content - and we don't want to
analyse every posting sentence-by-sentence, do we?
> So, let's finally talk about the technical part. You claim that the key
> metric in your product is the messaging rate, ie the number of packets
> you can send per second. You even have a fancy name for it, something
> like Hyper Duper Messaging :-)
Let me summarize what I consider the key issues:
- explicit MPI_Irecv/MPI_Send/MPI_Wait, or similar patterns implicitely in
MPI_Reduce/MPI_Alltoall/MPI_Allreduce with small messages (a few doubles, or a
few kB) are the dominant communication pattern in many MPI applications. There
are quite some (but not as many as one could wish) studies that show this.
- This means it's generally a good thing if the "ping" latency (duration of
MPI_Send in number of CPU cycles) is as low as possible.
- At this message size, CPU utilization or overlapping computing and
communication is not relevant, as (zero-copy) RDMA does not pay off until the
message gets at least some (typically >32, or more) kB in size, due to the
implied pinning and rendez-vous overhead. Also, MPI_Send has no opportunity for
overlap, and having a progress thread on the receive CPU steal cycles from the
application doesn't really help, neither.
- In these cases, all(?) interconnects do some sort of memcpy() within MPI_Send
to get rid of the data. The differences are
* How long does it take to prepare things for the memcpy()? This is Greg's
* When does the data arrive at the destination?
- But you never want to send millions of messages at once. This is
micro-benchmarking at its best. It gives some indications, but seen alone, it is
no prove for anything.
- *If* you feel you need to use such a new metric for whatever reason, you
should at least publish the benchmark that is used to gather these numbers to
allow others to do comparative measurements. This goes to Greg.
But I don't think that Greg's "Real Appliation Performance" white paper is
infamous. It states where the data comes from, you have to trust him for his
own numbers, and it does not directly link the differences in the application
performance to the messaging rate. Of course, it does not offer a scientific
analysis, and you can not compare it to papers like the ones from Leonid Oliker.
But I don't think it's unfair, and surely stimulates the competition for better
technical solutions or better white papers.
Joachim - reply to joachim at domain ccrl-nece dot de
Opinion expressed is personal and does not constitute
an opinion or statement of NEC Laboratories.
More information about the Beowulf