[Beowulf] Nahalem / PCIe I/O write latency?

Vincent Diepeveen diep at xs4all.nl
Fri Oct 23 06:38:28 PDT 2009

On Oct 22, 2009, at 7:17 PM, Patrick Geoffray wrote:

> Hey Larry,
> Larry Stewart wrote:
>> Does anyone know, or know where to find out, how long it takes to  
>> do a store to a device register on a Nahelem system with a  
>> PCIexpress device?
> Are you asking for latency or throughput ? For latency, it depends  
> on the distance between the core and the IOH (each QuickPath hop  
> can be ~100 ns if I remember well) and if there are PCIe switches  
> before the device. For throughput, it is limited by the PCIe  
> bandwidth (~75% efficiency of link rate) but you can reach it with  
> 64 Bytes writes.

The practice is a bit different with respect to latency.
In practice Larry will be busy with existing hardware components i  

The hardware in between for latency is even more important than the  
pci-e latency. Maybe.

To most Nvidia gpgpu cards you'll have about some dozens to hundreds  
of microseconds latency or so for example from processor
to hardware device over the pci-e.

For custom FPGA (Xilinx based) someone had designed already pci-x  
133Mhz had a limit there of 100k per second.
About 10 us that is. You really stress in such case the hardware a  
lot. So the latency itself 'eats' so to speak a considerable
amount (if not majority) of system time at such big stresstests.

The question is whether you want to do that.

For most devices that you can benchmark in this manner the latency of  
around some dozens to some hundreds of microseconds is a returning  
already for many years. Of course there is a considerable distance at  
the mainboard between the devices and the processor.

Will it ever improve?

A lot better is co-processors (obviously).


>>  Also, does write combining work with such a setup?
> Sure, write-combining works on all Intel CPUs since PentiumIII. It  
> only burts at 64 bytes though, anything else is fragmented at 8  
> bytes. AMD chips do flush WC at 16, 32 and 64 bytes.
> And don't assume that because you have WC enabled you will only  
> have 64 bytes writes. Sometimes, specially when there is an  
> interrupt, the WC buffer can be flushed early. And don't assume  
> order between the resulting multiple 8-byte writes either.
>> I recall that the QLogic Infinipath uses such features to get good  
>> short message performance, but my memory of it is pre- Nahelem.
> Nehalem just add NUMA overhead, and a lot more memory bandwidth.
>> Question 2 - if the device stores to an address on which a core is  
>> spinning, how long does it take for the core to return the new value?
> On NUMA, it depends if you write on the socket where you are  
> spinning. If it's same socket, the cache is immediately  
> invalidated. If you busy-poll on a different socket, cache  
> coherency gets involved and it's more expensive. On the local  
> socket, I would say between 150 and 250ns, assuming no PCIe switch  
> in between (those can cost 150ns more).
> Patrick
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list