[Beowulf] HPC fault tolerance using virtualization
Bogdan Costescu
Bogdan.Costescu at iwr.uni-heidelberg.de
Tue Jun 16 03:27:18 PDT 2009
On Tue, 16 Jun 2009, John Hearns wrote:
> I believe that if we can get features like live migration of failing
> machines, plus specialized stripped-down virtual machines specific
> to job types then we will see virtualization becoming mainstream in
> HPC clustering.
You might be right, at least when talking about the short term. It has
been my experience with several ISVs that they are very slow in
adopting newer features related to system infrastructure in their
software - by system infrastructure I mean anything that has to do
with the OS (f.e. taking advantage of CPU/mem affinity), MPI lib,
queueing system, etc. So even if the MPI lib will gain features to
allow fault tolerance, it will take a long time until they will be in
real-world use.
By comparison, virtualization is something that the ISVs can
completely offload to the sysadmins or system integrators, because
neither the application nor the MPI lib (which is sometimes linked in
the executable...) will have to be aware of it. The ISVs can then even
choose what virtualization solution they "support".
Another aspect, which I have already mentioned some time ago, is that
the ISV can much easier force the usage of a particular OS and
environment, because this runs in the VM and is independent of what
runs on the host. They can even provide a VM image which includes the
OS, environment and application and declare this as the only supported
configuration... this is done already for non-parallel applications,
but there's only one step needed for parallel ones: adapting it to the
underlying network to get the HPC level of performance. I think that
adapting to the queueing system is not really necessary from inside
the VM; the queueing system can either start one VM per core or start
one VM with several virtual CPUs to fill the number of processing
elements (or slots) allocated for the job on the node - if the VM is
able to adapt itself to such a situation, f.e. by starting several MPI
ranks and using shared memory for MPI communication. Further, to
cleanly stop the job, the queueing system will have to stop the VMs,
sending first a "shutdown" and then a "destroy" command, similar to
sending SIGTERM and SIGKILL today.
--
Bogdan Costescu
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de
More information about the Beowulf
mailing list