[Beowulf] clustering using xen virtualized machines

Tue Jan 26 15:18:40 PST 2010

>> Is it just me, or does HPC clustering and virtualization fall on
>> opposite ends of the spectrum?

depends on your definitions.  virtualization certainly conflicts with
those aspects of HPC which require bare-metal performance.  even if you
can reduce the overhead of virtualization, the question is why?  look
at the basic sort of HPC environment: compute nodes running a single 
distro, controlled by a scheduler.  from the user's or job's perspective,
there are just some nodes - which ones doesn't matter, or even how many 
in total.  the user _should_ be able to assume that when they land on a node,
it behaves as if freshly installed and booted de novo.  we don't reboot
nodes nodes between jobs, of course, or even make much effort towards
preventing a serial job from noticing other serial jobs on the same 
node (as containers would, let alone VMs).  but we could, without tons 
of effort, just lower utilization.

virtualization is about a few things:
 	- improve utilization by coalescing low-duty-cycle services.
 	- isolate services from each other - either to directly arbitrate
 	runtime resource contention, or to disentangle configurations.
 	- encapsulate all the state of a server so it can be moved.

I think the first axis is quite non-HPC, since I don't think of HPC jobs
as being like idle services.  (OTOH, many clusters have good utilization
because multiple workloads get interleaved _above_ the processor level.)
the second factor is not often an HPC problem, at least not in my experience,
where J Random Fortran user doesn't really care that much about the
environment (ie - want f77 and lapack and empty queues).  migration 
has some HPC appeal, since it permits defragmenting a cluster, 
as well as better preemption.

> Gavin, not necessarily. You could have a cluster of HPC compute nodes
> running a minimal base OS.
> Then install specific virtual machines with different OS/software stacks
> each time your run a job.

or for each job, just install the provided OS image on the bare metal...
your job's done, have it halt or reboot the node ;)

> OK, this is probably more relevant for grid or cloud computing - I first

grid and cloud computing are all part of the same game, no?  along with 
massively parallel low-latency MPI, old-style vector supercomputing, 
GPU-assisted computing, throughput serial farming, etc.

> thought this would be a good idea when seeing
> that (at the time) the CERN LHC Grid software would only run with Redhat
> 7.2
> So you could imagine 'packaging up' a virtual machine which has your
> particular OS flavour/libraries/compilers and shipping
> it out with the job.

right, that's one of the axes of the problem-space: whether the app gets its 
own custom runtime environment (in the sense of kernel, libc, etc).  another
axis is the degree to which the app has to contend for resources (as in an 
overcommited normal cluster, or a VM without guaranteed resources.)

> Another reason could be fault tolerance - you run VMs on the compute
> nodes. When you detect a hardware fault is coming along
> (eg from ECC errors or disk errors) you perform a live migration from
> one node to another - and your job keeps on trucking.
> (In theory, checkpointing needed etc. etc.)

I'm pretty skeptical about this - the main issue with checkpointing is 
when there are external side-effects.  checkpointing networked apps
(including MPI) is hard because you have state "in flight", so can only
freeze-dry the state by quiescing (letting the messages land, etc).

the "live migration" demos I've seen have been apps that are tolerant 
to the loss in-flight transactions (or which retry automatically).

so I don't think virt is any kind of paradigm-changer, 
just like manycore merely stretches existing definitions.

-mark