[Beowulf] 1.2 us IB latency?

Wed Apr 25 08:16:51 PDT 2007

On Wed, 2007-04-25 at 07:35 -0700, Christian Bell wrote:
> On Wed, 25 Apr 2007, Ashley Pittman wrote:
> 
> > You'd have thought that to be the case but PIO bandwidth is not a patch
> > on DMA bandwidth.  On alphas you used to get a performance improvement
> > by evicting the data from the cache immediately after you had submitted
> > the DMA but this doesn't buy you anything with modern machines.
> 
> Not a patch, but the main goal in many of our cases is minimizing the
> amount of time spent in MPI where conventional wisdom about offload
> doesn't unconditionally apply (and yes this contrary to where I think
> programming models should be headed).

I'm not sure I follow, surely using PIO over DMA is a lose-lose
scenario?  As you say conventional wisdom is offload should win in this
situation...

Mind you we do something completely different for larger messages which
rules out the use of PIO entirely.

> I recently measured that it takes InfiniPath 0.165usec to do a complete
> MPI_Isend -- so in essence this is 0.165usec of software overhead
> that also includes the (albeit cheap Opteron) store fence.  I don't
> think that queueing a DMA request is much different in terms of
> software overhead.  For small messages, I suspect that most of the
> differences will be in the amount of time the request (PIO or DMA)
> remains queued in the NIC before it can be put on the wire.  If
> issuing a DMA request implies more work for the NIC compared to a PIO
> that requires no DMA reads, this will be apparent in the resulting
> message gap (and made worse as more sends put in flight).  

True although it's also possible to use PIO to send the data to the NIC,
flush it and then issue a DMA to send it remotely, this can be lower
latency than using a DMA from main memory and works well for code where
it's important that the source buffer can be re-used after comms are
initialised (shmem_put for example).  Then of course you get people
commenting that it's not a true zero-copy library.

> In this regard, we have a pretty useful test in GASNet called
> testqueue to measure the effect of message gap as the number of sends
> are increased.  Interconnects varied in performance -- QLogic's PIO
> and Quadrics's STEN have a fairly flat profile, whereas Mellanox/VAPI
> was not so flat after 2 messages in flight and my Myrinet results are
> from very old hardware.  Obviously, I'd encourage everyone to run
> their own tests as various HCA revisions will have their own
> profiles.

This is in-line with what I would expect.

> I should come up with this test in an MPI form -- GASNet shows these
> metrics with the lower-level software that is used in many MPI
> implementations, so comparing the MPI metrics to the GASNet metrics
> could help identify overheads in MPI implementations.

I'm sure I've seen a benchmark like this before, something that measured
the latency of messages and then sees how much "work" can be done before
latency increases, in effect measuring the CPU overhead of a send.
Quadrics tends to look good when these figures are presented as absolute
numbers and bad when presented as % of latency by virtue of having lower
latency to start with.  I was recently asked to improve the percentage
figure and the best I could come up with was to put a sleep(1) on the
critical path.  I'm not sure if it is or not but if it is the GASNet
benchmark I'm thinking of could you change the way it reports results
please?

Ashley,