[Beowulf] Performance characterising a HPC application

Fri Mar 23 11:53:14 PDT 2007

Patrick, 

> -----Original Message-----
> From: Patrick Geoffray [mailto:patrick at myri.com] 
> Sent: Thursday, March 22, 2007 11:28 PM
> To: Gilad Shainer
> Cc: beowulf at beowulf.org
> Subject: Re: [Beowulf] Performance characterising a HPC application
> 
> Gilad,
> 
> Gilad Shainer wrote:
> >> -----Original Message-----
> >> People doing their homework are still buying more 2G than 
> 10G today, 
> >> because of better price/performance for their codes (and thin 
> >> cables).
> 
> > People doing their homework test their applications and decide
> 
> That's what I have always said.

So we agree :-)

> 
> > For that purpose. In your case, maybe your customers prefer your 2G 
> > over your 10G, but I am really not sure if most of the HPC 
> users are 
> > buying more 2G rather than other faster I/O solutions.....
> 
> My feedback is limited to Myrinet 2G vs Myrinet 10G when 
> customers actually run their codes and the performance gain 
> is either null or too small to be worth the price difference 
> (this criteria is of course very subjective). I don't know if 
> they test on "faster" solutions as well.

It is a subjective view and not a market statement. I am seeing the
opposite. 

> 
> > All the real applications performance that I saw show that 
> IB 10 and 
> > 20Gb/s provide much higher performance results comparing to
> 
> I think you mean IB 8 Gb/s and 16 Gb/s, since using the 
> signal rate instead of the data rate is not only confusing, 
> it is wrong and nobody else does it.
> 
> Furthermore, what you really mean is 8 Gb/s and 13.7 Gb/s 
> since this is the maximum throughput of a PCI Express 8x link (*).

So now we can discuss technical terms and not marketing terms such
as price/performance. InfiniBand uses 10Gb/s and 20Gb/s link signaling
rate. The coding of the data into the link signaling is 8/10. When 
someone refer to 10 and 20Gb/s, it is for the link speed and there
is nothing confusing here - this is InfiniBand specification (and a 
standard if I may say).   

The PCIe specification is exactly the same. Same link speed and same 
8/10 data encoding. When you say 13.7Gb/s you confuse between the 
specification and the MTU (data size) that some of the chipsets support.

For chipsets that support MTU > 128B, your calculation is wrong and the 
data throughput is higher. 

What is also interesting to know, is when one uses InfiniBand 20Gb/s
he/she
Can fully utilized the PCIe x8 link, while in your case, Myricom I/O
interface is the bottleneck. 

> 
> > your 2G, and clearly better price/performance. This is a good 
> > indication that applications require more bandwidth than 2G.
> 
> It depends on the applications and also who does the 
> benchmarking. The most common "marketing mistake" is to look 
> at GM numbers, not MX. In latency-bounded codes, Myrinet 2G 
> with MX does outperform IB Mellanox, even DDR. My own 
> measurements on real applications show that MX-2G sometimes 
> beats Mellanox IB DDR on medium messages, typically when the 
> registration cache is ineffective (malloc hooks unusable or limited
> IOMMU) or when the code tries to overlap.

There was no "marketing mistake" in the results that I have achieved and

saw from 3 non-bias parties. In all the application benchmarks, Myrinet 
2G shows poor performance comparing to 10 and 20Gb/s.
As for the registration cache comment, I would go back to the "famous" 
RDMA paper and the proper responds from IBM and others. The answer 
to this comment is fully described in those responses.  

> 
> Similarly, on many applications I have checked, Qlogic IB SDR 
> has better performance than Mellanox IB DDR, despite having a 
> smaller pipe (and despite Mellanox claiming the contrary).

Are you selling Myricom HW or Qlogic HW? 
In general, application performance depends on the interconnect
architecture
and not only on pure latency or pure bandwidth. Qlogic till recently (*)
had the lowest latency number but when it comes to application, the CPU
overhead is too high. Check some papers on Cluster to see the
application
results. 

When is come to bias testing, I saw (too many times) people taking
InfiniBand 
20Gb/s card, place it into PCIe x4 interface, and compare it to 10Gb/s
(other
IB, Eth etc.) placed in a PCIe x8 interface, and than claim that 10Gb/s
is 
better than 20Gb/s ....

> 
> There are a lot of external factors as well. An application 
> that is not bandwidth bounded can become one if the number of 
> cores increases for example. So different host configurations 
> yield different results.
> 
> Price/performance also depends on the price, and the price 
> depends on the market, the volume, the vendor relationship, 
> the competitive environment, etc. You seem to assume a high 
> price for Myrinet 2G, but that may not be a safe assumption.
> 

The only thing I assume as that price/performance is subjective
and serve as a marketing propaganda from EVERY vendor. 

> 
> In conclusion, I will repeat myself: I believe that bigger 
> pipes do not always have a better price/performance, nor even 
> simply better performance, it depends very much on the 
> application. The most used HPC interconnect in the world 
> today is still Gigabit Ethernet, and it has the best 
> price/performance ratio for a lot of codes.

I agree that the application performance depends on the full
Architecture of the interconnect and not just on single
points of performance. 

There are applications that don't use the Interconnect at
all, and in this case you can use mail....  
In most cases, and when you have more and more cores in a system,
The need for faster and fatter I/O increases and GigE will not
be sufficient any more. 

> 
> 
> Patrick
> 
> 
> (*) For the curious, the maximum efficiency of PCI Express x8 
> is 86%, best scenario. The Read DMA completions are 128 bytes 
> max on today's PCIE chipsets (default is 64 bytes), with a 20 
> bytes header composed of
> 4 bytes for DLL, 12 bytes for TLP 3DW, 4 Bytes of LCRC). 
> That's 20 bytes header for 128 bytes payload, ie 128/148 = 
> 0.86. Link data rate is 16 Gb/s, so 16*0.86 = 13.7 Gb/s after 
> protocol. With ECRC or on Intel chipsets, there is 4 more 
> bytes, so the max Read throughput becomes 13.5 Gb/s. The real 
> limit depends on the chipset and can be much lower than that.
> 
> --
> Patrick Geoffray
> Myricom, Inc.
> http://www.myri.com
> 
> 
>