updating the Linux kernel

j at cnb.uam.es j at cnb.uam.es
Mon Jun 12 11:23:01 PDT 2000


Kathy Haigh Hutchinson <K.Haigh-Hutchinson at Bradford.ac.uk> wrote:
> On Mon, 12 Jun 2000, David Lombard wrote:
> Task based parallelism could recover, my factorisation is essentially task
> based. But parallelism is more often about data parallelism. Preserving
> copies of the data every time something changes would be so much effort in
> storage and in time to do the storing, that benefits of distributing the
> data in the first place would be significantly reduced, or am I missing
> something?
> Kathy HH

I guess the answer is "it depends".

It depends on whether you have a copy of the data somewhere else and
whether you may do without analyzing some data. Two examples:

If the data is kept at a redundant site, you can always restart the
job. An example might be a central repository that handles data to
workers and wait for answers, or a distributed data set split into
pieces replicated on different nodes. You might start with a configuration
	data = union of sets 1..n
	store set i at node i AND node[ (i+1) % n] <AND node...>
then you start your job allocating node[i] for set[i].
If node[i] fails, you can then lookup which other nodes have a
copy and restart its part at one of those (a suboptimal situation
since one node would do double work, but still feasible). Then
you would code it as:

    for i = 1 to n do {
*	node[i] = allocate_one_node();
*    	send_data(node[i]);
*   	fd_set(socket[i], myset);
    do {
	done = YEPP;
       select(n, myset, NULL, NULL, long_enough);
	for i = 1 to n do {
	    if ( ! FD_ISSET(i, myset)) { /* this node hasn't finished yet */
		if (! isalive(node[i]) { /* failed to answer, e.g. UDP echo */
*		    node[i] = allocate_one_node();
*		    send_data (node[i]);
*		    fd_set(socket[i], myset);
		    done = NOT_YET;
		else {
		   /* Node is OK, assume it just takes longer */
		    done = NOT_YET;
    until (done == YEPP);

YMMV. You would gain much more efficiency if you splitted data into
n*n sets instead of n sets and replicated it such that if one node
fails, its start data would be replicated among all other nodes. Then
instead of only one node taking up the work of the deceased, all
nodes would take 1/n-th of the dead node's work, and load would be
split evenly. E.g. simplified:

make n  pieces
for i = 0 to n-1
	allocate piece i to  to node i
	divide piece i into n-1 sub-pieces
	for j = 0 to i-1
		allocate sub-piece j to node j
	for j = i to n
		allocate sub-piece j to node i+j

It may be more elegant, I know and more complex schemes may be devised.

Then, on failure of node i you know that its data is n*(i-1) to n*i and
that it is split on sub-pieces spread over all other nodes evenly, so, 
there you go:
	run job/iter on n nodes with n pieces
	wait for answers
	if one node dies, collect all n-1 answers
	start job again on n-1 nodes with n-1 subpieces 
	wait for sub-answers
	consolidate n-1 answers with n-1 sub-answers

And time would be 1/n in the best case, or 1/n+1/n^2 in the second.

Then for the next job/iter you may check if the node is back and 
repeat the scheme, or if it isn't and then increase work dataset 
of each remaining node with its sub-piece copy.

The second example is when you may as well try to survive with a partial
answer, e.g. you are looking for an answer and it *might* be close enough
or even already found, e.g. in a database search:

	set_timeout(long_enough, my_sig_handler);
	when an answer arrives {
	my_sig_handler() {
		if someone_failed {
			if (user_is_glad)

A signal handler might as well have been used in the first case, as an
asynchronous multiplexer would as well do on the second. It's up to you
to decide which suits you best. A signal handler may be combined with
a setjmp/longjmp jump if needed.

The point is that there is always a way out. 


Version: 2.6.3i
Charset: noconv


More information about the Beowulf mailing list