[Beowulf] Re: HPC fault tolerance using virtualization

Tue Jun 16 03:01:42 PDT 2009

John Hearns <hearnsj at googlemail.com> writes:

> I was doing a search on ganglia + ipmi (I'm looking at doing such a
> thing for temperature measurement)

Like
<URL:http://www.nw-grid.ac.uk/LivScripts?action=AttachFile&do=get&target=freeipmi-gmetric-temp>?
If you want to take action, though, go direct to Nagios or similar with
sensor readings, chassis health data, etc.

> Its something I've wanted to see working - doing a Xen live migration
> of a 'dodgy' compute node, and the job just keeps on trucking.
> Looks as if these guys have it working. Anyone else seen similar?

I don't understand what's wrong with using MPI fault tolerance.  I
recall testing LAM+BLCR and having processes migrate when SGE host
queues were suspended, but I'm not in a position to try the Open-MPI
version.  Nothing short of checkpoints will help, anyway, when the node
just dies, and that's the problem we see most often (e.g. because we
were sold a shambolic Barcelona system with flaky hardware and an OS
that doesn't support quad core properly).

How does Xen perform generally, anyhow?  Are there useful data on the
HPC performance impact of Xen and/or KVM for, say, Ethernet NUMA
systems?  I've only seen it for non-NUMA Infiniband systems.