Robert G. Brown
rgb at phy.duke.edu
Fri Nov 24 11:33:33 PST 2000
On Fri, 24 Nov 2000, Steven Berukoff wrote:
> We're putting together a small (~15) node cluster of Alphas (21264 @
> 667MHz) for use in a data analysis application. Our use of the cluster
> requires high performance computational capability (hence the use of the
> Alphas with their high memory capability) but doesn't involve high network
> traffic. Basically, each node will grab a large chunk of data, do FFTs on
> pieces of it, store the results locally, then only occasionally contact
> the master for more.
> So my question: can anyone provide good recommendations for a switch?
> Like I said, high network traffic is not to be a concern, but the cluster
> will be augmented at a later time, to perhaps 64 nodes. Obviously, the
> switch solution should be able to scale appropriately. Are there
> models/manufacturers definitely to avoid? Are there good cost/performance
> Any input is greatly appreciated!
For the scale and kind of operation you describe, it sounds like you are
not likely to be either bandwidth choked or contention choked at the
switch level -- if you are choked anywhere it would be at the point of
connection between your main master/server and the switch, and it sound
like even that isn't likely to be much of a problem (although you'd have
to provide more detail to be sure, see below).
If these assumptions are correct, you are in the happy position of being
able to buy damn near anything and not having your choice greatly affect
ultimate performance or your ability to scale to 64 nodes.
For example, you can likely consider a stack of 3 24 port switches with
or without gigabit uplinks, and if you are choked on the server either
put multiple NICs in the server (3, with one port to each switch seems
like a nice possibility) or channel bonded NICs to one switch, or a Gbps
NIC to one switch.
Alternatively, you can consider a larger switch with a fabric that can
support 64+ ports -- one that is frequently mentioned on this list as a
good price/performance/feature switch is the HP ProCurve switch, which
can be purchased over the counter for $1600-1800 with 40 ports. To go
beyond, I think it was Don Becker who recommended early this year that
one consider buying two 40 port HP ProCurves and putting all 80 ports
and the second power supply in one chassis (they support dual power).
That gives you more than enough ports and dual power for perhaps $3400,
or $42.50/port (plus the cost of the node NIC).
The stackable solutions will cost less than this per port -- you could
even get four 16 port non-stackable switches and plan to put 4 NICs in
your master/server to get to 60 nodes plus the master. 16 port switches
are dirt cheap, and since you don't know when you will get more nodes it
lets you take advantage of the even lower switch prices that will likely
hold if and when you ever need more ports. It is usually a good idea to
spend as little as possible now, as Moore's Law will buy you far more
for far less money later when you eventually need it.
The ProCurve solution (or similar solutions from other vendors) will
provide better scaled symmetric internode communications, but if you are
indeed in a master-slave paradigm the extra cost isn't likely to improve
performance. The only techinical questions I can think of that you
should be aware of to fuel the purchase decision are:
a) Try to get a fair idea of the ratio between the time each node
spends computing vs the time each node spends getting the next chunk of
data to work on or returning results. To avoid contention or waiting on
a network resource you need a ratio of AT LEAST N:1 where N is the
number of nodes you expect to use, and you'll only avoid contention with
N:1 if the calculation is perfectly organized to proceed predictably
synchronously. If the ratio is 6400:1 (you spend 6400 seconds computing
and 1 second getting the next set to compute) and you plan to go to 64
nodes, you're pretty safe -- even without a bit of deliberate
antibunching of starting times, the probabilities suggest that the
network will "never" be congested. If it is 16:1, you can't get to 64
nodes at all -- you'll have 3 nodes waiting for the fourth to get its
data, all the time.
b) Try to accurately estimate the number of small-packet transfers of
data required per node per FFT cycle. If it is a large number (and
cannot be reduced by sensibly rewriting your code) then you MIGHT be
performance sensitive to switch latency. In that event you should think
about a higher-end switch -- the cheap switches are store-and-forward
switches with an aggregate latency that is often in the 200 microsecond
range (0.2 milliseconds). This can cost you a lot of performance if you
have to send 500 small packets back and forth to start each FFT AND your
ratio from a) isn't favorable to begin with. Cutthrough switches are
much more expensive but also have much lower latency.
I doubt that manageability is very important to you, although it doesn't
greatly add to the cost of the switches either. The higher end switches
will be manageable and have slightly better latencies and so forth, but
it probably isn't worth it for a coarse-grain non-synchronous
Hope this helps.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf