[Beowulf] Q: IB message rate & large core counts (per node)?

Mon Mar 15 13:27:23 PDT 2010

Hi Richard,

I meant to reply earlier but got busy.

On 2/27/2010 11:17 PM, richard.walsh at comcast.net wrote:
> If anyone finds errors in it please let me know so that I can fix
> them.

You don't consider the protocol efficiency, and this is a major issue on 
PCIe.

First of all, I would change the labels "Raw" and "Effective" to 
"Signal" and "Raw". Then, I would add a third column "Effective" which 
consider the protocol overhead. The protocol overhead is the amount of 
raw bandwidth that is not used for useful payload. On PCIe, on the Read 
side, the data comes in small packets with a 20 Bytes header (could be 
24 with optional ECRC) for a 64, 128 or 256 Bytes payload. Most PCIe 
chipsets only support 64 Bytes Read Completions MTU, and even the ones 
that support larger sizes would still use a majority of 64 Bytes 
completions because it maps well to the transaction size on the memory 
bus (HT, QPI). With 64 Bytes Read Completions, the PCIe efficiency is 
64/84 = 76%, so 32 Gb/s becomes 24 Gb/s, which correspond to the hero 
number quoted by MVAPICH for example (3 GB/s unidirectional). 
Bidirectional efficiency is a bit worse because PCIe Acks take some raw 
bandwidth too. They are coalesced but the pipeline is not very deep, so 
you end up with roughly 20+20 Gb/s bidirectional.

There is a similar protocol efficiency at the IB or Ethernet level, but 
the MTU is large enough that it's much smaller compared to PCIe.

Now, all of this does not matter because Marketers will keep using 
useless Signal rates. They will even have the balls to (try to) rewrite 
history about packet rate benchmarks...

Patrick