julien.leduc at lri.fr
Thu Jul 26 02:28:32 PDT 2007
>> I'm interested in utilising the hardware to create something akin to
>> the sun grid or the amazon elastic computing cloud whereby the
>> resources available to the environment are automatically expanded and
>> contracted. Maybe I have the wrong end of the stick on how these
>> services operate.
> no, I think you're right on, and there's not much to it. why do you
> think Sun or Amazon have any special magic? beowulf clusters running
> multi-user queueing systems are precisely such an "elastic", "compute-
> on-demand" thingy, just without paying for the isolation, because such
> clusters are mainly motivated by performance.
Running a multi-user queueing system, you can have a cluster that
behaves like Sun or Amazon projects: you just choose the nodes that can
fullfill the user needs and requirements, fetch a VM on those chosen
nodes (during the 'prolog' section of the batch scheduler), start the
VMs on the physical nodes, ensure the user can log on those or fetch his
data / run a passive job. Then, once finished, clean up all that mess by
destroying the VM, and let another user reserve the node.
More isolation can be achieved, if the user needs to be root on the
node, to run a modified version of the kernel, or run several VMs on top
of his environment. For that, you have to let him deploy his own
environment on the node.
This last technique ensure reproductible experiments, more performances,
drawbacks are: more work on the middleware that make all that magic come
Combining the 2 previous techniques could help users to test their
OS+experimentation program in a VM and then deploy it at larger scale
for a true run on all the cluster(s ;) ).
This is a very interesting approach (at least for computer scientists)
and the second approach gives quite good results for the moment, the
combination of the 2 techniques has to be implemented to give away more
ressources so that users can test their environments on many virtual
nodes, consuming less physical nodes.
The main problem is to be able to control the nodes remotely, with
hardware supporting remote reboots, remote console management...
More information about the Beowulf