[Beowulf] HPC fault tolerance using virtualization

Ashley Pittman ashley at pittman.co.uk
Tue Jun 16 07:06:18 PDT 2009

On Tue, 2009-06-16 at 12:27 +0200, Bogdan Costescu wrote:
> You might be right, at least when talking about the short term. It has 
> been my experience with several ISVs that they are very slow in 
> adopting newer features related to system infrastructure in their 
> software - by system infrastructure I mean anything that has to do 
> with the OS (f.e. taking advantage of CPU/mem affinity), MPI lib, 
> queueing system, etc. So even if the MPI lib will gain features to 
> allow fault tolerance, it will take a long time until they will be in 
> real-world use.

This is true, ISV's like to statically link everything, lock things down
as much as possible and then rubber-stamp it as "supported".

> By comparison, virtualisation is something that the ISVs can 
> completely offload to the sysadmins or system integrators, because 
> neither the application nor the MPI lib (which is sometimes linked in 
> the executable...) will have to be aware of it. The ISVs can then even 
> choose what virtualization solution they "support".

So it's good for ISVs.  It's bad for the sysadmins, it's bad for the
system integrators and it's bad for the end users.

> Another aspect, which I have already mentioned some time ago, is that 
> the ISV can much easier force the usage of a particular OS and 
> environment, because this runs in the VM and is independent of what 
> runs on the host. They can even provide a VM image which includes the 
> OS, environment and application and declare this as the only supported 
> configuration...

This is frankly an insane way of doing things, the only justification I
can find for doing it is that ISV code is flaky and breaks if things
like say a network driver change underneath them.  The correct answer
for this is obviously to write better quality software, a job made
easier in the open-source world where it's a lot easier to re-compile
code should there be a underlying change in the OS.

> this is done already for non-parallel applications, 

I'll believe it, it's driven from the windows world (as is much of the
virtualisation hype) where it really is only possible to run one service
per OS instance so any complex set of software requires N underutilised
computers to function properly.  What virtualisation does is allow you
to run these N underutilised computers on one single computer.

> but there's only one step needed for parallel ones: adapting it to the 
> underlying network to get the HPC level of performance. I think that 
> adapting to the queueing system is not really necessary from inside 
> the VM; the queueing system can either start one VM per core or start 
> one VM with several virtual CPUs to fill the number of processing 
> elements (or slots) allocated for the job on the node - if the VM is 
> able to adapt itself to such a situation, f.e. by starting several MPI 
> ranks and using shared memory for MPI communication. Further, to 
> cleanly stop the job, the queueing system will have to stop the VMs, 
> sending first a "shutdown" and then a "destroy" command, similar to 
> sending SIGTERM and SIGKILL today.

So HPC as an industry can invest serious amounts of effort and time
converting cluster software into a model where the "application" is
really a closed-box virtual image that we simply "start" a number
instances of and wait for it to shutdown after itself.  This is
superficially great for ISVs as it's just like running in the cloud and
it "only" costs 5% in performance (which incidentally I don't believe
for a minute) it's a long was from "High Performance" however, perhaps
we should coin the phrase "Medium Performance Computing" for this model?

Maybe, just maybe for most people it's good enough.  At moderate scale,
up to a couple of hundred cores this is acceptable, cores are cheap
enough (by the hour) that simply paying for the 20% more cores to recoup
your 5% performance loss is not an issue, scaling a little bit further
isn't beyond the capability of many applications and if you've
simplified things elsewhere then the savings will pay for those extra
cores anyway.

The real losers here however are those people doing things at a scale
where you can't just chuck hardware at the problem and where you do
actually care about underlying performance, the traditional HPC crowd
who, lets be honest, are the ones with the money and the talent anyway.

It's as though HPC has gone or is infiltrating mainstream whilst at the
same time mainstream computing is jumping into the cloud.  All of a
sudden HPC doesn't fit into the mainstream model any more (not that it
ever really did IMHO) and all the recent converts are sat there in a
cloud of hot air looking back at us in bewilderment.



Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing

More information about the Beowulf mailing list