[Beowulf] Three notes from ISC 2006

Joachim Worringen see_signature_for_reply-to at ccrl-nece.de
Wed Jun 28 09:59:56 PDT 2006

Patrick Geoffray wrote:
> Greg Lindahl wrote:
>> On Wed, Jun 28, 2006 at 07:28:53AM -0400, Patrick Geoffray wrote:
>>> I have keep it quiet even when you where saying things driven by
>>> marketing rather than technical considerations (the packet per
>>> second nonsense),
>> Patrick, that "packet per second nonsense" is the technical reason our
>> interconnect does so well. If you'd like to argue about it,
>> technically, I'd be happy to do so. No need to keep quiet.
> My reservation was about the way you present it, not the technical idea 
> behind. Actually, my real concern was that there was no technical 
> content in your post, just references to white papers, ie marketing fluff.

An offer for "getting a secret white paper on request" is marketing, you are 
right. But at least the SPEC number was technical content - and we don't want to 
analyse every posting sentence-by-sentence, do we?

> So, let's finally talk about the technical part. You claim that the key 
> metric in your product is the messaging rate, ie the number of packets 
> you can send per second. You even have a fancy name for it, something 
> like Hyper Duper Messaging :-)

Let me summarize what I consider the key issues:
- explicit MPI_Irecv/MPI_Send/MPI_Wait, or similar patterns implicitely in 
MPI_Reduce/MPI_Alltoall/MPI_Allreduce with small messages (a few doubles, or a 
few kB) are the dominant communication pattern in many MPI applications. There 
are quite some (but not as many as one could wish) studies that show this.
- This means it's generally a good thing if the "ping" latency (duration of 
MPI_Send in number of CPU cycles) is as low as possible.
- At this message size, CPU utilization or overlapping computing and 
communication is not relevant, as (zero-copy) RDMA does not pay off until the 
message gets at least some (typically >32, or more) kB in size, due to the 
implied pinning and rendez-vous overhead. Also, MPI_Send has no opportunity for 
overlap, and having a progress thread on the receive CPU steal cycles from the 
application doesn't really help, neither.
- In these cases,  all(?) interconnects do some sort of memcpy() within MPI_Send 
to get rid of the data. The differences are
  * How long does it take to prepare things for the memcpy()? This is Greg's 
message rate.
  * When does the data arrive at the destination?
- But you never want to send millions of messages at once. This is 
micro-benchmarking at its best. It gives some indications, but seen alone, it is 
no prove for anything.
- *If* you feel you need to use such a new metric for whatever reason, you 
should at least publish the benchmark that is used to gather these numbers to 
allow others to do comparative measurements. This goes to Greg.

But I don't think that Greg's "Real Appliation Performance" white paper is 
infamous.  It states where the data comes from, you have to trust him for his 
own numbers, and it does not directly link the differences in the application 
performance to the messaging rate. Of course, it does not offer a scientific 
analysis, and you can not compare it to papers like the ones from Leonid Oliker. 
But I don't think it's unfair, and surely stimulates the competition for better 
technical solutions or better white papers.

Joachim - reply to joachim at domain ccrl-nece dot de

Opinion expressed is personal and does not constitute
an opinion or statement of NEC Laboratories.

More information about the Beowulf mailing list