[Beowulf] 1.2 us IB latency?

Sat Apr 21 04:16:44 PDT 2007

At 21:00 20.04.2007, "Steffen Persvold" <steffen.persvold at scali.com> wrote:
>So I'm guessing, both Myrinet MX and Qlogic Infinipath (confirmed) is
>using PIO for "small" messages. Are we sure that Mellanox ConnectX
>doesn't ? It seems they would have to in order to get the 1.2us numbers.
>There's nothing that stops them from doing :
>
>verbs_post_rdma_write() {
>...
>     if (msg_size < MAX_PIO_TRESHOLD) {
>         copybuffertoremotewithpio();
>     } else {
>         setupdmaengine();
>     }
>...
>}

PIO is a term with an two different 
interpretations. For a shared address space NIC, 
such as Dolphin's SCI adapters, PIO implies a 
sender CPU to write data directly into the user 
space of a remote process on a remote node. The 
cluster interconnect emulates a PCI to PCI bridge 
in this case. On other NICs, PIO implies using 
the processor to transmit the DMA description and 
the data to the local NIC. Then the local NIC 
issues a DMA to transmit the data/message to the 
remote node from a local buffer on the NIC. The 
main point is the local NIC doesn't have to issue 
a DMA read to local memory in order to read the DMA descriptor and data.

So, when Mellanox reduces the latency from around 
4 to around 1 usec, I assume they have modified 
the hardware-software interface of their HCA to 
enable PIO mode send operations, where DMA 
descriptor+data is transmitted on the PCI(e) bus 
using a single WC bus tenure.  I haven't used a 
PCI analyzer on their HCAs, but a thumb of rule 
is that every I/O operation to a NIC takes in the 
order of 1usec. So may be they have managed to go 
from 3 to one I/O operation in order to kick off 
a transfer. Pure speculation fro my side though.

Håkon

--
Håkon Bugge
CTO
dir. +47 22 62 89 72
mob. +47 92 48 45 14
fax. +47 22 62 89 51
Hakon.Bugge at scali.com
Skype: hakon_bugge

Scali - http://www.scali.com
Scaling the Linux Datacenter