[Beowulf] Nahalem / PCIe I/O write latency?

Patrick Geoffray patrick at myri.com
Thu Oct 22 10:17:35 PDT 2009

Hey Larry,

Larry Stewart wrote:
> Does anyone know, or know where to find out, how long it takes to do a 
> store to a device register on a Nahelem system with a PCIexpress device? 

Are you asking for latency or throughput ? For latency, it depends on 
the distance between the core and the IOH (each QuickPath hop can be 
~100 ns if I remember well) and if there are PCIe switches before the 
device. For throughput, it is limited by the PCIe bandwidth (~75% 
efficiency of link rate) but you can reach it with 64 Bytes writes.

>  Also, does write combining work with such a setup?

Sure, write-combining works on all Intel CPUs since PentiumIII. It only 
burts at 64 bytes though, anything else is fragmented at 8 bytes. AMD 
chips do flush WC at 16, 32 and 64 bytes.

And don't assume that because you have WC enabled you will only have 64 
bytes writes. Sometimes, specially when there is an interrupt, the WC 
buffer can be flushed early. And don't assume order between the 
resulting multiple 8-byte writes either.

> I recall that the QLogic Infinipath uses such features to get good short 
> message performance, but my memory of it is pre- Nahelem.

Nehalem just add NUMA overhead, and a lot more memory bandwidth.

> Question 2 - if the device stores to an address on which a core is 
> spinning, how long does it take for the core to return the new value?

On NUMA, it depends if you write on the socket where you are spinning. 
If it's same socket, the cache is immediately invalidated. If you 
busy-poll on a different socket, cache coherency gets involved and it's 
more expensive. On the local socket, I would say between 150 and 250ns, 
assuming no PCIe switch in between (those can cost 150ns more).


More information about the Beowulf mailing list