[Beowulf] Newbie questions on cluster technology to use

Wed Jul 25 16:58:17 PDT 2007

Hi all,

This is my first post, please accept my apologies if questions are too 
simple. I would appreciate pointers to documentation, writeups, howtos 
etc. I did some research before posting here, but got lost in sheer 
amount of information and competing technologies available.

We are planning to setup a cluster at my work place to handle some 
computation heavy jobs we have and the main task at the moment is to 
choose the right technology.

First of all, let me try to describe the task we have at hand.

1. There are a lot of relatively short jobs submitted by users. There 
are also much longer jobs submitted automatically at a known schedule.

2. Even though jobs are short (take minutes to complete on single 
machine) it is still important to parallelize each job to run them even 
faster (order of tens of seconds). That's financial industry we are 
talking about and time is money.

3. Jobs are quite easily parallelizable, probably embarrassingly so. 
Simple master/slave pattern naturally applies here. We already have 
parallel implementation running on a single host utilizing multiple 
processors via threads. It would be nice to be able to do it over many 
machines as well.

4. Jobs have to be scheduled properly, meaning that some users should 
have higher priority than others and especially than automated long 
running jobs, if user submits too many jobs his priority decreases, etc.

5. Implementation have to be fault tolerant, transparently surviving 
individual machine failures. Transparently for users that is, it is Ok 
to program tasks in a special way to get fault tolerance.

6. It would be nice to be able to submit "backup" tasks once job nears 
completion just in case some nodes in cluster are running slow. E.g. if 
job is split in 1000 tasks, runs on 16 node cluster and it is almost 
done, there are just 4 tasks to finish the job and there are a lot of 
idling nodes on cluster, scheduler could submit each of outstanding task 
to two machines and pick up results from whichever one completes first. 
If cluster is heterogeneous, or one node just runs slower it could 
speedup job completion considerably. At least that's what I've read in 
Google's mapreduce paper.

7. Some cluster health monitoring is needed. Does not have to be 
sophisticated, but at least we should be able to learn easily that some 
host has died and needs repairment. Statistics are nice to have as well 
to be able to adjust user priorities, make decisions on buying new 
hardware etc.

8. The business is somewhat Windows centric, though I would try to push 
Linux as a platform. It is doable, provided benefits are good. Linux 
port of the program is not a problem.

Potential solutions I see:

1. TIBCO distributed queue. In short it is a proprietary solution that 
more or less is a fault tolerant load balancing. The downside is absence 
of any scheduler (works as FIFO) and the fact that it is proprietary. We 
would much rather use open source technologies. See below for a bit of 
info on TIBCO.

2. MPI with some scheduler (Condor?). From what I read looks like fault 
tolerance is not easy to achieve in MPI world, and even if it is 
possible, then failure on a master node will render whole cluster 
unusable. I could be wrong on this, and I hope I am.

3. Torque? Grid Engine? Globus? Something else?

What are your suggestions? We need to decide on technology and try to 
implement it, gaining more knowledge in the process and hopefully making 
more informed decision in version 2 of our cluster. Any input would be 
greatly appreciated.

Some details on TIBCO. Tibco at heart is an enterprise messaging system, 
which propagates information via broadcasts on the same subnet and can 
route it from subnet to subnet via special daemon. It was mainly 
designed to integrate various systems in enterprise via common pipe 
where each system connects for data exchange, instead of building many 
point-to-point connections between individual systems. On top of this 
messaging technology they developed distributed queue, which works like 
this: you start many copies of app on many machines, they all find each 
other via broadcasts, elect a master among themselves, send heartbit 
messages every now and then to monitor healths of nodes, once message 
arrives, master chooses which worker should process it. If one of the 
nodes dies, master resubmits his task to other node. If master dies, 
remaining nodes elect new master and keep going from there. It is 
possible since all communication is done via broadcasts, and every node 
can maintain master's state.

Regards,

Kirill Lapshin