[Beowulf] Q: IB message rate & large core counts (per node)?

richard.walsh at comcast.net richard.walsh at comcast.net
Mon Mar 15 14:24:52 PDT 2010

On Monday, March 15, 2010 1:27:23 PM GMT Patrick Geoffray wrote: 

>I meant to respond to this, but got busy. You don't consider the protocol 
>efficiency, and this is a major issue on PCIe. 

Yes, I forgot that there is more to the protocol than the 8B/10B encoding, 
but I am glad to get your input to improve the table (late or otherwise). 

>First of all, I would change the labels "Raw" and "Effective" to 
>"Signal" and "Raw". Then, I would add a third column "Effective" which 
>consider the protocol overhead. The protocol overhead is the amount of 

I think adding another column for protocol inefficiency column makes 
some sense. Not sure I know enough to chose the right protocol performance 
loss multipliers or what the common case values would be (as opposed 
to best and worst case). It would be good to add Ethernet to the mix 
(1Gb, 10Gb, and 40Gb) as well. Sounds like the 76% multiplier is 
reasonable for PCI-E (with a "your mileage may vary" footnote). The table 
cannot perfectly reflect every contributing variable without getting very large. 
Perhaps, you could provide a table with the Ethernet numbers, and I will do 
some more research to make estimates for IB? Then I will get a draft to Doug 
at Cluster Monkey. One more iteration only ... to improve things, but avoid 
a "protocol holy war" ... ;-) ... 

>raw bandwidth that is not used for useful payload. On PCIe, on the Read 
>side, the data comes in small packets with a 20 Bytes header (could be 
>24 with optional ECRC) for a 64, 128 or 256 Bytes payload. Most PCIe 
>chipsets only support 64 Bytes Read Completions MTU, and even the ones 
>that support larger sizes would still use a majority of 64 Bytes 
>completions because it maps well to the transaction size on the memory 
>bus (HT, QPI). With 64 Bytes Read Completions, the PCIe efficiency is 
>64/84 = 76%, so 32 Gb/s becomes 24 Gb/s, which correspond to the hero 
>number quoted by MVAPICH for example (3 GB/s unidirectional). 
>Bidirectional efficiency is a bit worse because PCIe Acks take some raw 
>bandwidth too. They are coalesced but the pipeline is not very deep, so 
>you end up with roughly 20+20 Gb/s bidirectional. 

Thanks for the clear and detailed description. 

>There is a similar protocol efficiency at the IB or Ethernet level, but 
>the MTU is large enough that it's much smaller compared to PCIe. 

Would you estimate less than 1%, 2%, 4% ... ?? 

>Now, all of this does not matter because Marketers will keep using 
>useless Signal rates. They will even have the balls to (try to) rewrite 
>history about packet rate benchmarks... 

I am hoping the table increases the number of fully informed decisions on 
these questions. 

Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing 
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100315/655c611c/attachment.html>

More information about the Beowulf mailing list