updating the Linux kernel

j at cnb.uam.es j at cnb.uam.es
Mon Jun 12 11:23:01 PDT 2000


-----BEGIN PGP SIGNED MESSAGE-----

Kathy Haigh Hutchinson <K.Haigh-Hutchinson at Bradford.ac.uk> wrote:
> On Mon, 12 Jun 2000, David Lombard wrote:
> 
> Task based parallelism could recover, my factorisation is essentially task
> based. But parallelism is more often about data parallelism. Preserving
> copies of the data every time something changes would be so much effort in
> storage and in time to do the storing, that benefits of distributing the
> data in the first place would be significantly reduced, or am I missing
> something?
> 
> Kathy HH
> 

I guess the answer is "it depends".

It depends on whether you have a copy of the data somewhere else and
whether you may do without analyzing some data. Two examples:

If the data is kept at a redundant site, you can always restart the
job. An example might be a central repository that handles data to
workers and wait for answers, or a distributed data set split into
pieces replicated on different nodes. You might start with a configuration
like
	data = union of sets 1..n
	store set i at node i AND node[ (i+1) % n] <AND node...>
then you start your job allocating node[i] for set[i].
If node[i] fails, you can then lookup which other nodes have a
copy and restart its part at one of those (a suboptimal situation
since one node would do double work, but still feasible). Then
you would code it as:

    for i = 1 to n do {
*	node[i] = allocate_one_node();
*    	send_data(node[i]);
*   	fd_set(socket[i], myset);
    }
    set_timeout(long_enough);
    do {
	done = YEPP;
       select(n, myset, NULL, NULL, long_enough);
	for i = 1 to n do {
	    if ( ! FD_ISSET(i, myset)) { /* this node hasn't finished yet */
		if (! isalive(node[i]) { /* failed to answer, e.g. UDP echo */
*		    node[i] = allocate_one_node();
*		    send_data (node[i]);
*		    fd_set(socket[i], myset);
		    done = NOT_YET;
		}
		else {
		   /* Node is OK, assume it just takes longer */
		    done = NOT_YET;
		}
	    }	
	}
    }
    until (done == YEPP);

YMMV. You would gain much more efficiency if you splitted data into
n*n sets instead of n sets and replicated it such that if one node
fails, its start data would be replicated among all other nodes. Then
instead of only one node taking up the work of the deceased, all
nodes would take 1/n-th of the dead node's work, and load would be
split evenly. E.g. simplified:

make n  pieces
for i = 0 to n-1
	allocate piece i to  to node i
	divide piece i into n-1 sub-pieces
	for j = 0 to i-1
		allocate sub-piece j to node j
	for j = i to n
		allocate sub-piece j to node i+j

It may be more elegant, I know and more complex schemes may be devised.

Then, on failure of node i you know that its data is n*(i-1) to n*i and
that it is split on sub-pieces spread over all other nodes evenly, so, 
there you go:
	run job/iter on n nodes with n pieces
	wait for answers
	if one node dies, collect all n-1 answers
	start job again on n-1 nodes with n-1 subpieces 
	wait for sub-answers
	consolidate n-1 answers with n-1 sub-answers
	done.

And time would be 1/n in the best case, or 1/n+1/n^2 in the second.

Then for the next job/iter you may check if the node is back and 
repeat the scheme, or if it isn't and then increase work dataset 
of each remaining node with its sub-piece copy.


The second example is when you may as well try to survive with a partial
answer, e.g. you are looking for an answer and it *might* be close enough
or even already found, e.g. in a database search:

	start_subjobs();
	set_timeout(long_enough, my_sig_handler);
	wait_for_answers
	when an answer arrives {
		save_it_for_later
	}
	consolidate_all_answers
	show_user
...
	my_sig_handler() {
		check_all_nodes_are_alive
		if someone_failed {
			consolidate_answers_to_date
			show_user
			if (user_is_glad)
				OK
			else
				die_miserably
		}
	}

A signal handler might as well have been used in the first case, as an
asynchronous multiplexer would as well do on the second. It's up to you
to decide which suits you best. A signal handler may be combined with
a setjmp/longjmp jump if needed.

The point is that there is always a way out. 

				j


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBOUUqh7gsTQLvQjxFAQGVRAf/SrfS9NfCvGNMbHpiUXbITA6s5xOycmwx
Id9UpymyQm26D73bbF5loAOhmOQ/OBvhK3yfRahyD92Ubugdb7LSp1N6v28NRhBw
VdzKq8mPWFl4Wgbpumm2grzJmDZLTRXsRwuYiXq12DtH/Yf+SnlhTbI7/OxLBWxO
/J1wEzzPv+VHzeXhpBqZ86nb/97TZ118PPw+zgnvOxDt+SBNzbIrN/SWf4euZaxo
engcTE1RV5su8zSr3bbd/gbIsWpPn7tmuz6oVTB5nBU3ioKzdnVI4dyqh1SeZfQ6
yY0v0XOno0Tn5qMW6qJFNpOIxVK3fAZdlx7IU3P0fVUDYE6sy44Lfg==
=HuIX
-----END PGP SIGNATURE-----




More information about the Beowulf mailing list