[Beowulf] HPC fault tolerance using virtualization
Greg at keller.net
Tue Jun 16 09:14:04 PDT 2009
> Date: Tue, 16 Jun 2009 10:38:55 +0200
> From: Kilian CAVALOTTI <kilian.cavalotti.work at gmail.com>
> On Monday 15 June 2009 20:47:40 Michael Di Domenico wrote:
>> It would be nice to be able to just move bad hardware out from
>> under a
>> running job without affecting the run of the job.
> I may be missing something major here, but if there's bad hardware,
> are the job has already failed from it, right? Would it be a bad
> disk (and the
> OS would only notice a bad disk while trying to write on it, likely
> asked to
> do so by the job), or bad memory, or bad CPU, or faulty PSU.
> Anything hardware
> losing bits mainly manifests itself in software errors. There is
> very little
> chance to spot a bad DIMM until something (like a job) tries to
> write to it.
We have recently purchased "un-blade" systems that may fit into the
missing list. These are systems where multiple nodes are hard wired
into a single chassis and in order to work on 1, all of them have to
come offline. The power efficiency and system costs are compelling,
but the complexity of maintenance is a trade off we decided to try.
If the Virtualization tax was low enough it would be useful, and make
us more incented to use these more cost/power efficient options
without creating huge maintenance hassles.
> So unless there's a way to detect faulty hardware before it affects
> software, it's very likely that the job would have crashed already,
> before the
> OS could pull out its migration toolkit.
IF the job is running against a large Networked File System, but the
local *Real* OS is depending on the failing disk, the job could be
migrated off when the OS starts detecting SCSI or Network (IB?)
errors. Same is true for some network issues. Of course, who in
their right mind would want an OS dependent on a local disk these
Note: this is a Shameless plug for Perceus and all other such options
that leave spinning disk for scratch/checkpointing or some other lower
risk purpose... if any.
> The paper John mentioned is centered around IPMI for preventive fault
> detection. It probably works for some cases (where you can use
> like temperature probes or fan speeds), where IPMI detects hardware
> before it affects the running job. But from what I've seen most
> often, it's
> kind of too late, and IPMI logs an error when the job has crashed
> already. And
> even if it didn't crash yet, what kind of assurance to you have that
> result of simulation has not been corrupted in some way by that
> faulty DIMM
> you got?
Single Bit Errors likely won't corrupt the system, but it would be
nice to handle them when the pop up, rather than waiting for
maintenance windows or offlining the node and waiting for any jobs to
drain off of it. This would be a win for an admin to do maintenance
on their own schedules and minimize the actual lost compute time of
> My take on this is that it's probably more efficient to develop
> features and recovery in software (like MPI) rather than adding a
> virtualization layer, which is likely to decrease performance.
I agree. I was very excited about "Evergrid"'s (Now Librato?) notion
of universal checkpointing... but I've never been able to get any time
from/with them. This seems like an approach for checkpointing that
would work out very cleanly for many apps that are clueless on the
notion of checkpointing.
Moral of the story:
There was a day when the OS was a huge consumer of a workstations
resources (CPU/Memory/Disk) and as such a huge Tax. Today it's a
small fraction of the footprint, and so we worry less and less about
it and it's efficiency except where it impacts the performance/
stability of the apps that depend on it. My guess is that
Virtualization is just an extension of that trend and will eventually
be the way we need to go as the Tax of the OS / VM layers becomes more
Given that general trend, I am happy to see smart people who prefer
graceful code and efficiency trying to steer these VM options toward
low overhead solutions where I can firewall a bad code from leaving
machines in a bad state for the next code that tries to run there.
Should the applications be better stewards of the environment they run
Should the OS protect itself better from bad codes... yes.
Should the admin configure the OS better so that codes can't do bad
But I can't control those, I can control my OS and give the apps their
own OS via VM's if the Tax is low enough. Anything different would be
like saying we shouldn't need firewalls because the apps listening to
the ports shouldn't be hackable. It's true, but not something I want
to try and control.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf