[Beowulf] Lowered latency with multi-rail IB?

Dow Hurst DPHURST DPHURST at uncg.edu
Thu Mar 26 22:46:46 PDT 2009


To: beowulf at beowulf.org
From: Greg Lindahl <lindahl at pbm.com>
Sent by: beowulf-bounces at beowulf.org
Date: 03/27/2009 12:03AM
Subject: Re: [Beowulf] Lowered latency with multi-rail IB?

On Thu, Mar 26, 2009 at 11:32:23PM -0400, Dow Hurst DPHURST wrote:

> We've got a couple of weeks max to finalize spec'ing a new cluster.  Has 
> anyone knowledge of lowering latency for NAMD by implementing a 
> multi-rail IB solution using MVAPICH or Intel's MPI?

Multi-rail is likely to increase latency.

BTW, Intel MPI usually has higher latency than other MPI
implementations.

If you look around for benchmarks you'll find that QLogic InfiniPath
does quite well on NAMD and friends, compared to that other brand of
InfiniBand adaptor. For example, at

http://www.ks.uiuc.edu/Research/namd/performance.html

the lowest line == best performance is InfiniPath. Those results
aren't the most recent, but I'd bet that the current generation of
adaptors has the same situation.

-- Greg
(yeah, I used to work for QLogic.)

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


I'm very familiar with that benchmark page.  ;-)

One motivation for designing a MPI layer to lower latency with multi-rail is when making use of accelerator cards or GPUs.  There is so much more work being done that the interconnect quickly becomes the limiting factor.  One Tesla GPU is equal to 12 cores for the current implementation of NAMD/CUDA so the scaling efficiency really suffers.  I'd like to see how someone could scale efficiently beyond 16 IB connections with only two GPUs per IB connection when running NAMD/CUDA.

Some codes are sped up far beyond 12x and reach 100x such as VMD's cionize utility.  I don't think that particular code requires parallelization (not sure).  However, as NAMD/CUDA is tuned, the efficiency on the GPU is increased, and new bottlenecks found and fixed from previously ignored sections of code, there will be even more than a 12x speedup.  So, a solution to the interconnect bottleneck needs to be developed and I wondered if multi-rail would be the answer.  Thanks so much for your thoughts!
Best wishes,
Dow
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090327/5ae7c169/attachment.html>


More information about the Beowulf mailing list