[Beowulf] Performance characterising a HPC application

Mon Mar 26 10:04:13 PDT 2007

> Offload, usually implemented by RDMA offload, or the ability 
> for a NIC to autonomously send and/or receive data from/to 
> memory is certainly a nice feature to tout.  If one considers 
> RDMA at an interface level (without looking at the 
> registration calls required on some interconnects), it's the 
> purest and most flexible form of interconnect data transfer.  
> Unfortunately, this pure form of data transfer has a few caveats...

When Mellanox refers to transport offload, it mean full transport
offload - for all transport semantics. InfiniBand, as you probably 
know, provides RDMA AND Send/Receive semantics, and in both cases 
you can do Zero-copy operations. 

This full flexibility provides the programmer with the ability to choose
the 
best semantics for his use. Some programmers choose Send/Receive and
some RDMA. It is all depends on their application. 
>From your response, I see that Qlogic does not provide this kind
of flexibility.

>   
> How the programming model can match up with the semantics of 
> RDMA is the real question.  A quick sampling suggests that 
> global-address space languages fit squarely on top of RDMA, 
> whereas MPI-2 almost does if less of its windowing complexity 
> is considered.  MPI-1, the most popular model out there, has 
> the least in common with RDMA offload.  Under its simplest 
> form in MPI implementations, RDMA can be used for half of the 
> communication protocol involved in large messages.  In its 
> complex form, it can be used to handle small to medium-sized 
> messages as shown by a few openib/iwarp MPI implementations 
> (although these implementations really implement a complex 
> assortment of hybrid RDMA and non-RDMA mechanisms to provide 
> scalable performance).
> 
> RDMA offload, depending on the complexity of its 
> implementation, can buy you little to lots of communication 
> offload (or "total"
> communication offload in Quadrics' case).  But RDMA 
> implementations aside, you can only offload what the 
> programming model *and* the programmer will let you.  
> Programmers must understand data dependencies in their codes 
> and know where and how to separate communication initiation 
> and completion points.  Even well intentioned programmers can 
> fail to expose their apps for communication offload -- 
> complex legacy apps can be intimidating to modify, some apps 
> may have strong data dependencies and others may be dominated 
> by collectives which are themselves indivisible (i.e.
> blocking).  And finally, a programmer who can successfully 
> overcome over all these hurdles cannot expect to be provided 
> with an equal level of overlap on all interconnects.
> 
> There's a good reason that many programmers continue to find 
> refuge in simple offload-less primitives like Send/Recv: the 
> expectation that its in the interest of every MPI and 
> interconnect vendor to provide the best Send/Recv possible.
>   
> Many competent programmers will reap definite benefits from 
> highly specialized implementations of RDMA offload.  But then 
> again, these programmers will also know how to analyse their 
> applications and may come to completely different 
> conclusions.  For example, they may come to realise that most 
> of their codes cannot fully benefit from offload and that the 
> interconnect that spends the least time in specific MPI 
> primitives is the best choice -- hardware-assisted 
> operations, pt-to-pt midsize message performance or 
> consistent cluster-wide message latency, etc.  Understanding 
> the expected performance of specific communication primitives 
> is an application-centric view of performance evaluation.  
> Assuming that more cores necessarily require fatter pipes, 
> pt-to-pt latency measurements, signaling rates, messaging 
> rates, etc. are all microbenchmark-centric view of 
> interconnect evaluation.  Picking on the latter is just too 
> simplistic and rarely translates into a general and 
> verifiable view of the world, but it's good fodder for 
> oneupmanship and insipid (but
> entertaining) inter-vendor bickering.
> 
> RDMA offload is attractive for many other reasons  but in the 
> context of today's most popular programming model it isn't as 
> vital as one would like. It's reasonable conventional wisdom 
> that offload is a desirable feature, but the way programming 
> models have been moving (i.e. not moving), interconnects that 
> do not offer elaborate communication offload mechanisms are 
> not at a loss, far from it.
> Efficiently exploiting a low-level RDMA engine for the 
> purposes of message passing would mean enabling its pure data 
> transfer capability to percolate through the many levels of 
> software stack and programming model semantics mostly 
> unscathed.  This is an unrealistic expectation.
> 
> I've yet to see a significant number of message-passing 
> applications show that an RDMA offload engine, as opposed to 
> any other messaging engine, is a stronger performance 
> determinant.  That's probably because there are other equally 
> important and desirable features implemented in other 
> messaging engines.
> 
> 
> cheers,
> 
> 
>     . . christian
> 
> --
> christian.bell at qlogic.com
> (QLogic SIG, formerly Pathscale)
>