[Beowulf] Re: Why Do Clusters Suck?

David Mathog mathog at mendel.bio.caltech.edu
Tue Mar 22 14:12:05 PST 2005

> On Tue, 2005-03-22 at 12:42, David Mathog wrote:
> > 
> > More to the point, while there is certainly a lot
> > of room for improvement, an awful lot of work is
> > getting done today using existing cluster technology
> > and it's far from clear to me that an advance
> > in cluster management software would result in much more
> > productivity.  As opposed to, for instance, improving
> > network throughput, CPU power, or component reliability by
> > a factor of 10, any one of which would lead to an immediate
> > and dramatic productivity increase.
> > 
> Would it?

Yes.  Programs tend to either be CPU limited and/or bandwidth
limited.  If you improve the relevant components the program
will speed up to the point that something else becomes the new
bottleneck.  For most of our work now the CPU or memory bandwidth
is limiting but for some operations (data distribution) the
network bandwidth is.

> Myrinet or IB is more than enough bandwidth for
> us

Ok.  Now imagine what would happen if you dropped back to 100baseT,
which is what I'm still using.

 (weather and ocean modes, nearest neighbor communications), we prefer
> better latency.

> We have over a thousand nodes and hardware
> reliability has never significantly impacted our users and their
> productivity. 

We've lost up to 2 of our 20 nodes at a time.  Most of our
tasks depend upon particular data set slices being distributed
across the nodes. When one node goes down it takes several
hours to redistribute the data appropriately among the
remaining nodes.  If I had 1000 nodes this would become
enough of a problem that I'd have to redo the data distribution 
method and build in something resembling a RAID like redundancy.

> Our biggest problem is the immaturity of development
> tools. 

I feel your pain on that one.

> It is all too common to hear
> developers tell me things like "does it work if you turn off bounds
> checking?".

Egads!  I'm a big fan of building and testing programs on as many
completely different platforms as possible, and with every
possible warning enabled.  That does wonders for wringing latent
bugs out of code.


David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech

More information about the Beowulf mailing list