mladwig at comcast.net
Sun May 26 17:07:45 PDT 2002
First thanks for the detailed response. I'm continuing to work through it,
but here is some additional information.
On Sunday 26 May 2002 12:53, Robert G. Brown wrote:
> To see how
> far you can scale it, you need to estimate or measure some more numbers,
> in particular how long your IPC's will take relative to your
The small blocks are all precalculated "constants" (the 1.5M component is the
variable) to the system. They will rarely change. The problem is in what
order to apply the constants to the variable in order to convince yourself
that you have a work unit answer.
Right now, I know the sizes of info travelling through the system, and how
long the core calculation takes; I don't know yet how the components of the
parallelized design will perform or how they interact. I think I can only
determine them statistically through experimentation once I build it, because
they will vary wildly depending on the variable being processed.
> Now, I have no idea how many slaves you can afford or the relative value
> of your result. 100 cpu clusters aren't THAT expensive. Looking back
At this stage of the work, I can probably justify purchasing enough nodes to
prove how many nodes I need to really solve the problem :-). I'm trying to
get to that first step. In the long run, if the problem can be solved
sufficient nodes will be purchased.
> c) Go to a multiple masters parallelization -- arrange a sort of a
> tree of masters, each with 200 slaves, themselves communicating back
> with a supermaster.
Hmmm. In some ways, the problem lends it self to this, but determining the
optimum number of slaves per master is difficult as it can continuously vary
depending on the situation. Because of this variability I would ideally be
able to dynamically reassign these resources, and that would in turn
complicate resource prelocation (mentioned below).
> cluster (plus a master) isn't particularly expensive -- with rackmount
> dual Athlons I'd guesstimate less about $1000-1100 per cpu, not much
This is a good time to ask another question. The calculation is highly
optimized to the P4; does anyone have 1u-style P4 nodes that they like in
> In others it isn't -- each slave gets a unique set of numbers being
> (say) read from a big file on the master's disk (which, by the way,
> might take a nontrivial amount of time to locate on disk, read into a
> buffer, and arrange to send to the next free slave, hence the need to
> estimate the total amount of SERIAL work done by the master to send an
> IPC packet off to the next free slave!).
An optimization would be to prelocate the constants to the node(s) that will
be using them, keep them in memory or on local disk, and not need to transfer
them while a work unit is ongoing. When I've looked at some of the batch
processing tools (e.g. SGE - thanks for the suggestion Rayson), I don't see a
way to support this kind of resource "affinity". Did I miss something?
> Hope this helps.
Yes, tremendously - much thanks.
More information about the Beowulf