[Beowulf] The True Cost of HPC Cluster Ownership

Tue Aug 11 11:46:05 PDT 2009

Joe Landman <landman at scalableinformatics.com> wrote:

> I am arguing for commodity systems.  But some gear is just plain junk. 
> Not all switches are created equal.  Some inexpensive switches do a far 
> better job than some of the expensive ones.  Some brand name machines 
> are wholly inappropriate as compute nodes, yet they are used.
> 
> A big part of this process is making reasonable selections. 
> Understanding the issues with all of these, understanding the interplay.

A lot of this issue boils down to a lack of available information, or
perhaps, the cost of obtaining the information.

Consider for instance the switches you cited above.  How is the average
site going to decide which switch is better before purchase?  On paper,
going by the published specs, they will often look identical.  The two
companies may be equally reputable.  Still, one device may be a piece of
junk and the other best in class.  On very rare occasions there will be
an independent review available.  Only a large site is likely to have
the resources to obtain samples of each switch and test them
extensively.  The best most of us can do is ask around if "switch XYZ is
OK" before making the leap.

With compute nodes performance information is more readily available,
often in reviews, but again, rarely any reliability information.  And we
have all seen models which crunch nicely but have innate reliability
problems that don't turn up in a 3 day review, and then bite hard during
continuous use.  Again, large sites can obtain a test unit and beat on
it for a few months, but small sites usually cannot.  At least in this
case knowledge does build up over time in the community, so if one waits
for a machine to be in the field for a year, it may be possible to ask
around and find out if it is a good idea to buy some.  (But don't wait
too long, the sales life for computer models is not very long!)  For
this reason, unless a site is very well funded, buying cutting edge
compute nodes is a rather large gamble.

If the resources to run these tests isn't present in house, one may
essentially buy the expertise by paying enough to a reputable company to
run the tests.  Either way, knowing costs money.

Ideally there would be accepted standards for testing performance and
reliability of each class of equipment, and the manufacturers would run
these tests themselves, or farm it out to neutral entities, and then
publish this information.  It would certainly be a compelling sales
tool, at least from my perspective.  In practice, it usually seems like
the manufacturers spend more time hiding equipment defects than they do
in proving and publishing its strengths.

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech