gigabit switches

Mon Dec 3 11:39:09 PST 2001

On Mon, Dec 03, 2001 at 07:15:29PM +0100, Steven Berukoff's all...
> 
> Hello all,
> 
> My group is currently outlining the plans for a ~140 node dual athlon

280 cpus?! wow!

> cluster.  Our networking needs are minimal: no internode communication,
> rare master-slave communication (with xfers of ~tens KB), and even more
> rare very large data transfers from slave to server (~100s of MB).
> 
> Now, we are currently considering a hierarchical network structure, to
> minimize costs.  In particular, we planned on having each set of ~15 nodes
> connected to a 16 port 100Mbit switch with a Gbit uplink.  Then, each Gbit
> line gets plugged into a Gbit switch.  

Are you even using NFS here? Sounds like you barely even need a network! :)

100s of MBs of data over 100Mbps is N x 8s of xfer at full speed (say 10
for safety). Not a big deal to be busy on the network for 30s twice a day
or even twice an hour.

Even if you are using NFS, why do you need gigabit uplink with that low amount
of traffic? Why not just give a port on the fileserver for every switch? there
are DEC 4 port ethernet cards that we swear by. Every 15 nodes can use 100Mbps
for itself on the 4 port cards. (Or every 7, see below. 5cardsx4 ports = 20
ports in the NFS server, no need for a GB Uplink.)

140/16 = 9 switches. 9 ports is all you need for 100mbps for every 16.
(well technically, its 15 ports per switch for nodes, 1 uplink, which makes
10 switches).

Or partition the 16 port switches into 2x8 port switches (or use cheap
but reliable 8 port switches! we swear by DLINK and SURECOM 8 ports)
and give every 7 nodes 100Mbps. thats 14.3Mbps if they're all running at
the same time on the network on average (an average of 15Mbps is alot
of data for a node for disk only - internode communications is a VERY
different thing of course, which we're not talking about here). 

Is timelyness of communications with the NFS server important? Like if
something is timestamped for time 'n' does the xfer have to be completed
before time n+m? Or can you stand a bitof lag? Remember, if 7 nodes start
talking to the fileserver at the same moment, then they all take a bit longer
to finish xferring.

What we've found is eventually one node gets ahead of the others and they all
get staggered if they happen to have the EXACT same access timing pattern when
they're running similar jobs. Eventually they're staggered and they all get
100Mbps to themselves when they talk to the NFS server (which they only use
for a few dozen seconds/hour anyway). But this is all pretty irrelevant
because even if 1s turns into 20s of delay for accesses that need to happen
every 30 minutes, you're losing 40s an hour for a cost savings of $x0,000 for
gigabit gear. Divide the cost of the extra Gb gear into the cost per node
and see if you win.

40s/3600s = 1.1%
300s/3600s = 8%
600s/3600s = 16%

If you can save enough money by not getting Gigabit gear, then you can
buy n% more nodes. If non-gb gear is m% faster, and n>m, then you win.
But you do need to know exactly how network throughput you are ygoing to use.
Find your maximum, test it with a few live machines, measure it and be sure.

Before we designed our cluster, I ran a bunch of jobs on a node over NFS and
watched how each job used up bandwidth. I also ran multiple jobs on multiple
machines on the same network and graphed the bandwidth to see when things were
maxed out. In our case it was infrequently, and average bandwidth per node was
2Mbps. With a 100Mbps network you could easily run 15-20 nodes without maxing
out (and in fact thats what we're doing.  I prepared 6 seperate 100Mbps
networks for the cluster but we've never broken 70Mbps for all of our nodes,
which lasted for 15 seconds once only. The average for us when all nodes are
loaded is only 30Mbps.) So you DO need to know exactly the network usage profile here.

Put more nodes in for that money and suddenly you have 

Remember, 100Mbps per node is 12.5MBps (well, we see up to 10-11MB/s with
NFSv3 and intel eepro FXP cards on freebsd). Thats pretty good, quite
comparable with the average IDE drive. To need to use that much bandwidth
constantly for disk means you arent doing much calculation or could improve
the model - you could imagine that if you need to get that much data on and
off disk, that you could speed things up majorly by just having 2 or 4 times
as much ram around for caching on each node (with NFSv3 or real disks).
Changing the speed of the CPU would change this dynamic as well (ratio of cpu
speed:network speed) a large amount. So Im imagining that you dont need major
network bandwidth from what you've described, assuming you need to communicate
over NFS. (Do you have local disks? Do your nodes need to hammer them for i/o
or could you use NFS? Swap over NFS isnt all that hard. :)

> The question I have to you, my dear reader, is:  Do there exist 16 or 24
> port Gbit switches suited for this purpose?  The only ones I know of are
> made by Foundry Networks, who manufacture 16, 24, and 32 port
> models.  Have I missed some?  If so, does anyone have yea/nay
> comments?  

Comments? Ya, empty your pockets into this dollar sign bag of mine here. :)

Anything exists for a price. I glanced briefly at all this as I marvelled
at the prices:

about $8-$10K USD for 24 port GBE switch and $3.5-4K USD for 12 ports.
They may have dropped prices by now, but I am sure that its not $100 for
an 8 port GBE switch yet. Make sure you really need it before you spend
the cost of 3-8 nodes on every 12 ports.

> Your comments and suggestions are, as always, very appreciated.

Dont believe that because you are using commodity parts for a 'supercomputer'
that you still need to use the most expensive commodity setup. You
are still allowed to save money even if you didnt buy a Fujitsu monster.

(Can you buy more nodes, or is 140 the set number?)

/kc

> 
> Steve
> 
> 
> =====
> Steve Berukoff					tel: 49-331-5677233
> Albert-Einstein-Institute			fax: 49-331-5677298
> Am Muehlenberg 1, D14477 Golm, Germany		email:steveb at aei.mpg.de
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA