[Beowulf] Re: Why Do Clusters Suck?

Craig Tierney ctierney at HPTI.com
Tue Mar 22 13:26:06 PST 2005


On Tue, 2005-03-22 at 12:42, David Mathog wrote:
> On Mon, 21 Mar 2005 Douglas Eadline wrote:
> > 
> > So why do clusters suck?
> 
> Would that they did suck - it would make cooling them a lot
> easier.  Unfortunately while most of them blow pretty well their
> sucking sucks.
> 
> Which has nothing to do with your original post, assuming either of
> these messages actually make it through anybody's spam filters.
> 
> More to the point, while there is certainly a lot
> of room for improvement, an awful lot of work is
> getting done today using existing cluster technology
> and it's far from clear to me that an advance
> in cluster management software would result in much more
> productivity.  As opposed to, for instance, improving
> network throughput, CPU power, or component reliability by
> a factor of 10, any one of which would lead to an immediate
> and dramatic productivity increase.
> 

Would it?   As far as hardware specs, it
depends on your needs.  Myrinet or IB is more than enough bandwidth for
us (weather and ocean modes, nearest neighbor communications), we prefer
better latency.  CPU power is nice, but you can use the same CPUs in a
cluster that you can use in big iron, you just have to pay more.  That
isn't a cluster issue.  We have over a thousand nodes and hardware
reliability has never significantly impacted our users and their
productivity.  Some HA setups might help with filesystems and admin
servers, but we are already as > 99.9% availability on hardware that is
a single point of failure in the cluster without HA.

Our biggest problem is the immaturity of development
tools.  Another way to put that is "my compiler doesn't reproduce
the bugs in the other compilers my users are accustom to using"
or "Fortran isn't a standard, it is a suggestion". It is a rare creature
that writes clean, portable code.  It is all too common to hear
developers tell me things like "does it work if you turn off bounds
checking?".  I spend way too much time with new users trying to explain
to them the difference between 'code porting' and 'bug fixing'.  


Craig






> Regards, 
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list