[Beowulf] 1.2 us IB latency?

Kevin Ball kball at pathscale.com
Tue Apr 24 15:06:56 PDT 2007


On Tue, 2007-04-24 at 08:55, Ashley Pittman wrote:
> On Sat, 2007-04-21 at 13:16 +0200, Håkon Bugge wrote:
> > PIO is a term with an two different 
> > interpretations. For a shared address space NIC, 
> > such as Dolphin's SCI adapters, PIO implies a 
> > sender CPU to write data directly into the user 
> > space of a remote process on a remote node. The 
> > cluster interconnect emulates a PCI to PCI bridge 
> > in this case. On other NICs, PIO implies using 
> > the processor to transmit the DMA description and 
> > the data to the local NIC. Then the local NIC 
> > issues a DMA to transmit the data/message to the 
> > remote node from a local buffer on the NIC. The 
> > main point is the local NIC doesn't have to issue 
> > a DMA read to local memory in order to read the DMA descriptor and data.
> 
> That would explain why qlogic use PIO for up to 64k messages and we
> switch to DMA at only a few hundred.  For small messages you could best
> describe what we use as a hybrid of the above descriptions, we write the
> a network packet across the PCI bus and don't DMA at all.
> 
> The downside to PIO of course is you need a CPU to drive it so besides
> the fact it's slow you can't make do anything asynchronously.
> 
> > So, when Mellanox reduces the latency from around 
> > 4 to around 1 usec, I assume they have modified 
> > the hardware-software interface of their HCA to 
> > enable PIO mode send operations, where DMA 
> > descriptor+data is transmitted on the PCI(e) bus 
> > using a single WC bus tenure.  I haven't used a 
> > PCI analyzer on their HCAs, but a thumb of rule 
> > is that every I/O operation to a NIC takes in the 
> > order of 1usec. So may be they have managed to go 
> > from 3 to one I/O operation in order to kick off 
> > a transfer. Pure speculation fro my side though.
> 
> That's an interesting theory, but I suspect your numbers are a little
> out.  My own measurements put a PIO word write in the region of .15 uSec
> depending on chipset.  Of course if you are right then the remaining PIO
> write is happening in 1 uSec which leaves only .2uSec for the network
> which seems a little fast to me.
> 
> Regardless of how they have done it 1.2 is impressive, what would make
> me even more impressed if it was quoted as 1.20 which would, as far as
> I'm aware, mean that they had the lowest latency of anybody.

This is true if the 1.2 number is quoted through a switch, but as I
understand it Mellanox quotes back-to-back numbers as their latency
numbers.  I have measured QLogic HTX adapters within 50ns of 1.0 usec if
going back to back, but noone I'm aware of actually uses IB that way; 
everyone wants to run in a cluster with more than 2 nodes using a
switch, so thats how we quote our latency.

Disclosure: in case its not clear from the above, I do work at QLogic,
but anyone with our HT cards can reproduce the above for themselves.

-Kevin

> 
> Ashley,
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list