COTS was Re: [Beowulf] 96 Processors Under Your Desktop

Jim Lux james.p.lux at jpl.nasa.gov
Wed Sep 1 12:41:24 PDT 2004


rgb wrote, in response to Joachim:
> To build a cluster we order "standard nodes", selecting a hardware
> configuration from what the current crop of commodity choices permit.
> It is delivered.  We rack 'em up.  We boot 'em and they PXE/kickstart
> installs themselves, then yum-maintain themselves.
>
> Although this requires time and expertise to set up, it scales to an
> arbitrary number of nodes with little additional work.  We use the nodes
> until they break or age out to obsolescence and are retired, fixing the
> hardware as long as it makes economic sense to do so.
>

"requires time and expertise to set up" is of course what makes clusters (as
a completed system) not COTS, even though the components or subassemblies
may be COTS.

What Orion is doing is a custom design of the assemblies and the software,
so that the finished product (the cluster) is a COTS device.     They're
moving the "COTS boundary" to a higher level of integration. Or, at least,
that's what they hope to do.

So, as Joachim pointed out, you can get better performance, or minimized
hassle, per unit cost(on a recurring cost basis) with a customized design of
the "system".

One might find that using off the shelf standardized components or
subassemblies might reduce some aspect of the cost/hassle of the system: I
doubt any beowulf builders doing onesie/twosie or even dozen quantities have
any desire to lay out motherboards and arrange for manufacturing.  Even
fewer are going to go to customized CPU ASICs (although, consider something
like Myrinet... they developed custom ICs to meet their requirements, since
there was no COTS part available).

Cluster computing, at least in the beginning, was the idea of leveraging the
huge volumes of consumer products to reduce the NRE costs to as low as
possible.  The recurring costs in a full custom design (at what ever level
of integration) will probably be lower than the recurring costs in a COTS
design of the exact same performance, just because you can make that cost
one of the design criteria.

Without knowing anything about how Orion's financial structure works or what
their business model might be, it looks like they're betting on selling
enough "clusters in a box" to recoup the fairly substantial development
costs for the first unit out the door.  They've also targeted a market where
they won't get cannibalized and undercut by enthusiastic free labor
assembling systems bought on eBay onto bakery racks.  Looking over the
datasheets for the Orion box, I doubt that you could duplicate it for $10K.
You could probably duplicate the performance (in a flops sense), but it
would be bigger, probably consume more power, and you'd be spending a fair
number of hours on the task.

The real challenge for Orion (or any other cots cluster vendor) is whether
they can get the software to require the same minimal support that desktop
software currently requires. The hardware is a fairly known quantity, and
getting sufficient reliability and diagnostics in a ground-up design is
straightforward (since, basically, it's the same hardware as everyone else
is using, at the bottom levels).  The same is not true of cluster software,
which is hardly distributed on a scale comparable to, say, Matlab, Spice,
HFSS, or, dare I say, MS Office.  If they are targeting "cluster
unsophisticated" users, those users probably aren't going to want to
understand the subtleties of interconnect latencies, zero copy network
stacks, MPI vs PVM, backplane bandwidth, etc.  They're going to just want
their user application to run faster with minimal attention to the
"multiprocessor-ness" of it.  The advantage they have is that these users
are also not going to worry about eke'ing out the last FLOP of performance.
As long as the application runs faster on the Orion box than some other box
they could have spent $10K on (or 100K), they'll be happy.







More information about the Beowulf mailing list