[Beowulf] 1.2 us IB latency?
Håkon Bugge
Hakon.Bugge at scali.com
Sat Apr 21 04:16:44 PDT 2007
At 21:00 20.04.2007, "Steffen Persvold" <steffen.persvold at scali.com> wrote:
>So I'm guessing, both Myrinet MX and Qlogic Infinipath (confirmed) is
>using PIO for "small" messages. Are we sure that Mellanox ConnectX
>doesn't ? It seems they would have to in order to get the 1.2us numbers.
>There's nothing that stops them from doing :
>
>verbs_post_rdma_write() {
>...
> if (msg_size < MAX_PIO_TRESHOLD) {
> copybuffertoremotewithpio();
> } else {
> setupdmaengine();
> }
>...
>}
PIO is a term with an two different
interpretations. For a shared address space NIC,
such as Dolphin's SCI adapters, PIO implies a
sender CPU to write data directly into the user
space of a remote process on a remote node. The
cluster interconnect emulates a PCI to PCI bridge
in this case. On other NICs, PIO implies using
the processor to transmit the DMA description and
the data to the local NIC. Then the local NIC
issues a DMA to transmit the data/message to the
remote node from a local buffer on the NIC. The
main point is the local NIC doesn't have to issue
a DMA read to local memory in order to read the DMA descriptor and data.
So, when Mellanox reduces the latency from around
4 to around 1 usec, I assume they have modified
the hardware-software interface of their HCA to
enable PIO mode send operations, where DMA
descriptor+data is transmitted on the PCI(e) bus
using a single WC bus tenure. I haven't used a
PCI analyzer on their HCAs, but a thumb of rule
is that every I/O operation to a NIC takes in the
order of 1usec. So may be they have managed to go
from 3 to one I/O operation in order to kick off
a transfer. Pure speculation fro my side though.
Håkon
--
Håkon Bugge
CTO
dir. +47 22 62 89 72
mob. +47 92 48 45 14
fax. +47 22 62 89 51
Hakon.Bugge at scali.com
Skype: hakon_bugge
Scali - http://www.scali.com
Scaling the Linux Datacenter
More information about the Beowulf
mailing list