[Beowulf] IB in the real world
Josh England
jjengla at sandia.gov
Thu May 12 16:58:44 PDT 2005
On Thu, 2005-05-12 at 14:32 -0700, Bill Broadley wrote:
> I've been looking at the high performance interconnect options.
> As one might expect every vendor sells certain strengths and accuses
> the competition of certain weaknesses. I can't think of a better place
> to discuss these things. The Beowulf list seems mostly vendor neutral,
> er at least peer reviewed, and hopefully some end users actually using
> the technology can provide some real world/end user perspectives.
>
> So questions that come to my mind (but please feel free to add more):
>
> 1. How good is the OpenIB mapper?
Thats myrinet talk. I assume you're talking about the Subnet Manager
(OpenSM)? Short answer: its good enough, but could certainly use some
improvement.
> It periodically generates static
> routing tables maps of available IB nodes?
> It's the critical piece
> for handling adding a node or removing a node and keeping a cluster
> functioning? Reliable?
You need an SM for the IB fabric to work, yes. AFAIK, you shouldn't
start seeing any issues with the SM until you start to getting up into
the 1000+ node count.
>
> 2. How good is the OpenIB+MPI stack(s)? Any reliable enough for large
> month long jobs?
No MPI has been ported to the OpenIB stack yet. The verbs
implementation was just completed a couple months ago. I believe a few
efforts are currently underway. If you want MPI, you're stuck with a
vendor's IB stack for a little while yet.
> Which? I've heard rumors of large IB clusters that
> never met the acceptance criteria. FUD or real? Related to IB
> reliability or performance?
>
> 3. How good are the mappers that run inside various managed switches?
> Reliable? Same code base? Better or worse than the OpenIB mapper?
They work.
>
> 4. IB requires pinned memory per node that increases with the total
> node count, true?
Host memory or HCA memory?
This is true, especially for the connection-based protocols. Newer VAPI
implementations have addressed this by implementing a shared receive
queue that the MPI can take advantage of (Mvapich-0.9.5 does) to reduce
memory consumption. I'm pretty sure OpenIB has shared receive queues as
well. Still, the memory consumption won't hurt too bad until you start
hitting 1000+ processes in a single job.
> In all cases? Exactly what is the formula for memory
> overhead? It is per node? IB card? Per CPU? Is the pinned memory
> optional? What are the performance implications of not having it?
Per connection. Pinned memory is not optional -- you need to open up
send/receive queues for transferring data. You can use the UD protocol
to reduce memory consumption although it is currently slower than RC.
>
> 5. Routing is static?
Yes. I think some (proprietary?) subnet managers may be capable of
assigning multiple LIDS to each HCA end point and derive multiple paths
between each end point. I don't think OpenSM currently does this. An
ideal solution would be to have a switch/SM capable of adaptive
dispersive routing.
> Is there flow control? Any handling of hot spots?
> How are trunked lines load balanced (i.e. 6 IB ports used as an uplink
> for a 24 port switch). Load balancing across uplinks? Arbitrary
> topology (rings? tree only? mesh?) Static mapping between downlinks
> and uplinks (no load balancing)? Cut through or store and forward?
> Both? When? Backpressure?
The subnet manager typically does a good enough job of balancing the
routes, but they are still static. The topology can be arbitrary, but
fat trees can provide better performance.
>
> 6. What real world latencies and bandwidths are you observing on production
> clusters with MPI? How much does that change when all nodes are running
> the latency or bandwidth benchmark?
I don't have the numbers offhand, but I recall about 7-9ns latency and
~=800MB/s on PCI-X and ~=1200MB/s on PCIe.
> 7. Using the top500 numbers to measure efficiency what would be a
> a good measure of interconnect efficiency? Specifically RMax/RPeak
> for a given similar size cluster?
>
> 8. Are there more current HPC Challenge numbers than
> http://icl.cs.utk.edu/hpcc/hpcc_results.cgi? Are these benchmark
> results included in all top500 submissions? It seems like a good place
> to measure latency/bandwidth and any relation to cluster size.
>
> 9. Most (all?) IB switches have Mellanox 24*4x chips in them? What is
> the actual switch capacity of the chip? 20GBit*24? Assuming a
> particular clock speed? Do switches run that clock speed? 4x SDR
> per link? DDR?
No DDR yet, but soon.
-JE
>
> I'd be happy to summarize responses, or just track the discussion on
> the list. I'm of course interested in similar for quadrics, Myrinet, and
> any other competitors in the Beowulf interconnect space. Although maybe
> that should be delayed for a week each. Does anyone know of a better
> place to ask such things and get a vendor neutral response (or at least
> responses that are subject to peer review)?
>
> Material sent to me directly, NOT covered by NDA, can be included in
> my summary anonymously by request.
>
> --
> Bill Broadley
> Computational Science and Engineering
> UC Davis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>
More information about the Beowulf
mailing list