[Beowulf] 1.2 us IB latency?
Håkon Bugge
Hakon.Bugge at scali.com
Wed Apr 25 02:31:05 PDT 2007
At 17:55 24.04.2007, Ashley Pittman wrote:
>That would explain why qlogic use PIO for up to 64k messages and we
>switch to DMA at only a few hundred. For small messages you could best
>describe what we use as a hybrid of the above descriptions, we write the
>a network packet across the PCI bus and don't DMA at all.
I assume QsNet has to do something with the
packet after it has been written to the HCA.
Since the outbound PCI address space is only
32-bits (who needs more than 4GigB of CSR, other
than cluster people attempting to map all the
accumulated memory of the nodes in the cluster
into a single address space?), I assume QsNet
uses part of the packet as 64-bit address
information and starts a DMA from the HCA local
buffer to the remove destination.
>The downside to PIO of course is you need a CPU to drive it so besides
>the fact it's slow you can't make do anything asynchronously.
This is a classic tradeoff. Most applications
_create_ the message before it is sent (contrary
to many p2p benchmarks). Hence, it resides in the
L1 or L2 cache of the CPU with a (MOESI) Modified
state. It is the very efficient to use the CPU to
read its local cache and write the message using
the WC buffer. Contrary, the HCA has to issue a
DMA read to memory, the CPU cache(s) is snooped,
data is transferred to the memory _and_ to the
HCA. The cache state ends up in Shared state, and
a bus transaction is required in order to make it
Modified again (when the buffer is written the next time).
>That's an interesting theory, but I suspect your numbers are a little
>out. My own measurements put a PIO word write in the region of .15 uSec
>depending on chipset. Of course if you are right then the remaining PIO
>write is happening in 1 uSec which leaves only .2uSec for the network
>which seems a little fast to me.
Just to make sure we compare the same thing; the
.15usec is the time from the CPU issuing the
store instruction until the side effect is
visible in the HCA? In other words, assume a CSR
word read takes 0.5usec, a loop writing and
reading the same CSR take 0.65usec, right? If
that the case, CSR accesses have improved radically the last years.
Håkon
More information about the Beowulf
mailing list