[Beowulf] 512 nodes Myrinet cluster Challanges

Vincent Diepeveen diep at xs4all.nl
Mon May 1 13:37:49 PDT 2006


With so many nodes i'd go for either infiniband or quadrics, assuming the 
largest partition also gets 512 nodes.

Scales way better at so many nodes, as your software will need really a lot 
of
communications as you'll probably need quite a lot of RAM for the 
applications at all nodes.

Of course most want to sell you myri as it's simply cheaper; they might earn 
more onto it a node.

For this type of code, the network you use and the total amount of RAM are 
the 2 most important choices.

You could consider putting 2 network cards in each node, assuming each node 
is quite big, in order to give
the highend network completely to the RAM communication.

As i/o already has quite a huge latency, for the slow latency network for 
i/o you could do with a huge bandwidth network
and bad latency and a real state of the art highend network for the memory 
communication.
The problems you can expect depend largely on the number of users that's 
gonna use your cluster simultaneously.

More users = more problems.

Just avoid using all that commercial software for putting nodes to work that 
most manufacturers try to sell you.
My experience is that PDSH works pretty good to start work.

Does your software handle dying nodes and can the network hotswap them?

If not, just consider the odds that sometimes a node needs maintenance.

How do you want to divide the cluster, into 1 partition of 512 nodes, or do 
you plan all kind of small partitions?

A network is of course more expensive when you have 1 huge cluster than when 
you divide it in small partitions.
If a node dies, then with several small partitions, your other partitions 
run further without problems. Just the partition with
the dying node has a problem.

Most likely that dying node just has some dust inside its psu :)

Vincent

----- Original Message ----- 
From: "Walid" <walid.shaari at gmail.com>
To: <beowulf at beowulf.org>
Sent: Wednesday, April 26, 2006 11:34 AM
Subject: [Beowulf] 512 nodes Myrinet cluster Challanges


> Hi all,
>
> Does any one know what types of problems/challanges for big clusters?
>
> we are considering having a 512 node cluster that will be using
> Myrinet as its main interconnect, and would like to do our homework
>
> The cluster is meant to run an inhouse fluid simulation application
> that is I/O intensve, and requires large memory models.
>
> any hints, pointers will be apperciated
>
> TIA
>
> Walid.
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 




More information about the Beowulf mailing list