[Beowulf] Q: IB message rate & large core counts (per node)?

Gilad Shainer Shainer at mellanox.com
Mon Mar 15 14:33:07 PDT 2010

To make it more accurate, most PCIe chipsets supports 256B reads, and the data bandwidth is 26Gb/s, which makes it 26+26, not 20+20. 





From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of richard.walsh at comcast.net
Sent: Monday, March 15, 2010 2:25 PM
To: beowulf at beowulf.org
Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)?



On Monday, March 15, 2010 1:27:23 PM GMT Patrick Geoffray wrote: 


>I meant to respond to this, but got busy. You don't consider the protocol

>efficiency, and this is a major issue on PCIe.


Yes, I forgot that there is more to the protocol than the 8B/10B encoding,

but I am glad to get your input to improve the table (late or otherwise).

>First of all, I would change the labels "Raw" and "Effective" to 
>"Signal" and "Raw". Then, I would add a third column "Effective" which 
>consider the protocol overhead. The protocol overhead is the amount of 


I think adding another column for protocol inefficiency column makes

some sense.   Not sure I know enough to chose the right protocol performance

loss multipliers or what the common case values would be (as opposed

to best and worst case).  It would be good to add Ethernet to the mix

(1Gb, 10Gb, and 40Gb) as well.  Sounds like the 76% multiplier is 

reasonable for PCI-E (with a "your mileage may vary" footnote).  The table

cannot perfectly reflect every contributing variable without getting very large. 

Perhaps, you could provide a table with the Ethernet numbers, and I will do

some more research to make estimates for IB?  Then I will get a draft to Doug

at Cluster Monkey.  One more iteration only ... to improve things, but avoid

a "protocol holy war" ... ;-) ... 


>raw bandwidth that is not used for useful payload. On PCIe, on the Read 
>side, the data comes in small packets with a 20 Bytes header (could be 
>24 with optional ECRC) for a 64, 128 or 256 Bytes payload. Most PCIe 
>chipsets only support 64 Bytes Read Completions MTU, and even the ones 
>that support larger sizes would still use a majority of 64 Bytes 
>completions because it maps well to the transaction size on the memory 
>bus (HT, QPI). With 64 Bytes Read Completions, the PCIe efficiency is 
>64/84 = 76%, so 32 Gb/s becomes 24 Gb/s, which correspond to the hero 
>number quoted by MVAPICH for example (3 GB/s unidirectional). 
>Bidirectional efficiency is a bit worse because PCIe Acks take some raw 
>bandwidth too. They are coalesced but the pipeline is not very deep, so 
>you end up with roughly 20+20 Gb/s bidirectional.


Thanks for the clear and detailed description.

>There is a similar protocol efficiency at the IB or Ethernet level, but 
>the MTU is large enough that it's much smaller compared to PCIe.


Would you estimate less than 1%, 2%, 4% ... ??

>Now, all of this does not matter because Marketers will keep using 
>useless Signal rates. They will even have the balls to (try to) rewrite 
>history about packet rate benchmarks...

I am hoping the table increases the number of fully informed decisions on

these questions.


Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20100315/941b967c/attachment.html>

More information about the Beowulf mailing list