[Beowulf] choosing a high-speed interconnect

Mark Hahn hahn at physics.mcmaster.ca
Tue Oct 12 14:06:41 PDT 2004


> I'm sure posing this may raise more questions than answer but which
> high-speed interconnect would offer the best 'bang for the buck':
> 
> 1) myrinet
> 2) quardics qsnet
> 3) mellanox infiniband

at least in the last cluster I bought, Myrinet and IB had similar
overall costs and MPI latency.  so far at least, I haven't found 
any users who are bandwidth-limited, and so no reason there to prefer IB.
(Myri can match the others in bandwidth if you go dual-port; that
approximately doubles the Myri cost, though, making it clearly more 
expensive than IB.)

quadrics is more expensive, but also much faster in latency, and 
competitive with IB in bandwidth.  (there are only three interconnects
that can claim <2 us latency: quadrics elan4, SGI's numalink and the cray
xd1/octigabay.) 

IB vendors swear up and down that they're cheaper than Myri,
lower-latency, higher bandwidth and taste great with iced cream.
I must admit to some skepticism in spite of lacking any IB experience ;)
it does seem clear that upcoming PCI-e systems will let IB vendors
drop a few more chips off their nic, and theoretically come down to 
the $2-300/nic range.  as far as I know, switches are staying more or 
less at the same price.  and it's worth remembering that IB still 
doesn't have *that* much field-proof (questions regarding whether IB
will continue to be a sole-source ecosystem, issues of integrating 
with Linux, rumors of sticking points  regarding pinned memory, qpair 
scaling in large clusters, handling congestion, etc.)

> Currently, our 30 node dual Opteron (MSI K8D Master-FT boards) cluster
> uses Gig/E and are looking to upgrade to a faster network. 

why?  how have you evaluated your need for faster networking?
do you know whether by "faster" you mean latency or bandwidth?
offhand, I'd be a little surprised if a 30-node cluster made 
a lot of sense with quadrics, since you're unlikley to *need*
the superior latency.  (ie, it seems like people jones for low-lat
mainly when they have frequent, large collective operations. 
where large means "hundreds" of MPI workers...)

> As well, what are the components would one need for each setup?  The
> reason I ask is for example the Myrinet switches accept different line
> cards and am not sure which one to use.  Sorry if this a bit of a newbie
> question but I have no experience with any of this kind of hardware. I
> am reading the docs for each but thought your feedback would be good.

hmm, myrinet's pages aren't stunningly clear, but also not *that* 
hard, since they do describe some sample configs.

for instance, you can see the "small switches" section of
http://www.myrinet.com/myrinet/product_list.html
and notice that it's all based on a single 3U enclosure,
one or two 8-way cards (M3-SW16-8F) and an optional monitoring
card (M3-M).

for a 32-node cluster, you'd need 32 nics, a 5-slot cab, 4x M3-SW16-8F's,
either a monitoring card or a blanking panel, and 32 cables.  if you have 
fairly firm and short-term plans for adding more nodes, consider getting
a bigger chassis.  if you have any reason to do IO over myrinet (speed!),
consider giving the fileserver(s) dual-port access...

configuring other networks is not drastically different, though they 
often have different terminology, etc.  for instance, quadrics switches
can be configured with "slim" fat-trees (partially populated with 
spine/switching cards.)  configuration beyond a single switch cab
also tends to be interesting ;)

regards, mark hahn.




More information about the Beowulf mailing list