Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] 1.2 us IB latency?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Kevin Ball kball at pathscale.com
Tue Apr 24 15:06:56 PDT 2007


On Tue, 2007-04-24 at 08:55, Ashley Pittman wrote:
> On Sat, 2007-04-21 at 13:16 +0200, Håkon Bugge wrote:
> > PIO is a term with an two different 
> > interpretations. For a shared address space NIC, 
> > such as Dolphin's SCI adapters, PIO implies a 
> > sender CPU to write data directly into the user 
> > space of a remote process on a remote node. The 
> > cluster interconnect emulates a PCI to PCI bridge 
> > in this case. On other NICs, PIO implies using 
> > the processor to transmit the DMA description and 
> > the data to the local NIC. Then the local NIC 
> > issues a DMA to transmit the data/message to the 
> > remote node from a local buffer on the NIC. The 
> > main point is the local NIC doesn't have to issue 
> > a DMA read to local memory in order to read the DMA descriptor and data.
> 
> That would explain why qlogic use PIO for up to 64k messages and we
> switch to DMA at only a few hundred.  For small messages you could best
> describe what we use as a hybrid of the above descriptions, we write the
> a network packet across the PCI bus and don't DMA at all.
> 
> The downside to PIO of course is you need a CPU to drive it so besides
> the fact it's slow you can't make do anything asynchronously.
> 
> > So, when Mellanox reduces the latency from around 
> > 4 to around 1 usec, I assume they have modified 
> > the hardware-software interface of their HCA to 
> > enable PIO mode send operations, where DMA 
> > descriptor+data is transmitted on the PCI(e) bus 
> > using a single WC bus tenure.  I haven't used a 
> > PCI analyzer on their HCAs, but a thumb of rule 
> > is that every I/O operation to a NIC takes in the 
> > order of 1usec. So may be they have managed to go 
> > from 3 to one I/O operation in order to kick off 
> > a transfer. Pure speculation fro my side though.
> 
> That's an interesting theory, but I suspect your numbers are a little
> out.  My own measurements put a PIO word write in the region of .15 uSec
> depending on chipset.  Of course if you are right then the remaining PIO
> write is happening in 1 uSec which leaves only .2uSec for the network
> which seems a little fast to me.
> 
> Regardless of how they have done it 1.2 is impressive, what would make
> me even more impressed if it was quoted as 1.20 which would, as far as
> I'm aware, mean that they had the lowest latency of anybody.

This is true if the 1.2 number is quoted through a switch, but as I
understand it Mellanox quotes back-to-back numbers as their latency
numbers.  I have measured QLogic HTX adapters within 50ns of 1.0 usec if
going back to back, but noone I'm aware of actually uses IB that way; 
everyone wants to run in a cluster with more than 2 nodes using a
switch, so thats how we quote our latency.

Disclosure: in case its not clear from the above, I do work at QLogic,
but anyone with our HT cards can reproduce the above for themselves.

-Kevin

> 
> Ashley,
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list