Question about custers

Fri Feb 7 12:39:06 PST 2003

On Fri, Feb 07, 2003 at 03:11:17PM -0500, Robert G. Brown's all...
> On Fri, 7 Feb 2003, Ken Chase wrote:
> 
> > On Fri, Feb 07, 2003 at 07:25:31PM +0100, KNT's all...
> > > I'm only interested in calculating it theoretical (simple: on paper, no
> > > computers used). Because it's needed for me to esteminate the power of a
> > > non-existent cluster. I thought that the difference between
> > > 'theoretical' and 'practical' will be obvious. My mistake.
> > > 
> > > About 'power': I don't know an apropriate word in english for "computer
> > > mathematical calculation ability" ;).
> > 
> > The difference between 'fastest speed for single job execution' and
> > 'number of jobs throughput per month' call for quite different cluster
> > configurations for the money.
> > 
> > Realistically I believe most people require the latter, but for some reason
> > (bragging rights? impressing lay people?) the former is always sought
> > after. Anyone care to comment?
> 
> My only modifications of this are that you might add "per dollar spent"
> and refer to "total work done" rather than number of jobs per se.  Some
> people might run only a single job per month, but be very interested in
> the size or total amount of work that job could get done by some other
> measure.

As I said - 'for the money'. Its all about money, which is my primary critique
of almost all cluster models discussed on the list here. Few people ever
mention the actual per dollar cost, _AS IF_ they were working for the nuke
stewardship project (well, some of them are.. but which? :), the DEA or
colombian drug cartels (cmon, fess up, youz!).

> Without the connection of money and cost benefit analysis, of course one
> gets the fastest possible nodes and so forth.  It's only when one looks
> at the amount of work one can get done for your fixed budget that one
> suddenly realizes that one can buy a very nice complete 2.4 GHz P4
> compute node for what one pays for a 3.0 GHz P4 CPU alone.

Or you can get n nodes with myrinet or some other HSI (high speed
interconnect) or you can get 2n or more nodes without HSI (in our case we got
2.8n in our own quote, and the vs competing quote for a cluster we offered 4x
as many nodes as their HSI -- they were proposing all HA gear. I suppose
you have to do that if you have so few nodes :).

> CBA is the key to happy cluster design.  Spend your money getting what
> you need to get the most work done in the least amount of time, for your
> budget.

Now you're going to talk about power and cooling to run these things. Gosh,
what are you, some sort of unapologetic REALIST?! BEGONE from our list! :)

> Unless, of course, you are backed by the full buying power of the U.S.
> Government, in which case you get whatever you damn well please...:-)

See above.

Its not just a CBA, its designing a cluster for whats its going to be used
for. If you design it for one person to get their single job done in
1 hour, its only going to be good at that. n nodes gets the single job
done in time t. If you have 2 jobs, its time 2 t. 10 jobs 10t.

What if you managed to get 3n nodes for the same money by avoiding HSIs and
went GBE? Now your jobs run in time 1.5t because it doesnt scale as well. To
run 10 jobs takes 10 x 1.5t = 15 t on n nodes. But you have 3n nodes
for the money. So your jobs take 15/3 t = 5 t. Your cluster has
twice the 'thoughput' of the HSI cluster. Its not 'as fast' at single jobs,
but it sure gets things done quick when more than 1 job is runnign (in this
case 10 jobs was my baseline, running 3 at a time).

Obviously the pathalogical case is to not get ANY HSI and put all the money
into nodes. Run all jobs on single nodes. This gives 100% efficiency
across the  whole cluster, but is a pain in the ass to manage jobs.

However, gigabit is around and its 1/10th the cost per node of HSI. It doesnt
scale as well (ironically mitigating the need for huge fabric GBE switches -
you can stick to cheap 12 and 16 port GBE, if you need switches at all), but
it can scale (depending on what you're doing) relatively usefully for small
numbers of nodes. So on a 100 node cluster you might find it a pain to run 100
jobs and manage them all for the sake of 'efficiency and throughput' but if
you add GBE to the equation, you've increased the cost by 10% but you can now
run things on several nodes at a time (again, depending on how parallelizable
what you're runinng is and how sensitive it is to latency). So now you manage
8 or 12 or 15 jobs at a time, not 100. And you have more total throughput
than if you had bought HSI gear.

There may of course be some cutoff values which HSI performs above - you may
see no slowdown at all on HSI gear with your particular computational model,
but pathologically bad scaling on GBE. It really depends on your computational
algorithms and amount of message passing, as always. However, for most things,
HSI is subject to the same scaling concerns as GBE, its not a panacea. Its how
much you want to pay to be able to get 50% total scaling efficiency running on
16 or 20 nodes instead of merely 8.

===== 

Im curious, when people see really poor scaling on their clusters (HSI
or GBE or 100BT, doesnt matter) at like 16 or 32 or more nodes (Im thinking
CHARMM and Gromacs here), what do you do with the extra cpu? Just let it
float away unused? Do you use it? Do you run other jobs on them at
the same time? Do you nice those jobs to 19? Do you see your cache being
thrashed by this, as well as the scaling characteristics of your mesh
degrade very steeply?

/kc

>    rgb
> 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> 

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA