[Beowulf] HPC fault tolerance using virtualization
John Hearns
hearnsj at googlemail.com
Mon Jun 15 10:59:37 PDT 2009
I was doing a search on ganglia + ipmi (I'm looking at doing such a
thing for temperature measurement) when I cam across this paper:
http://www.csm.ornl.gov/~engelman/publications/nagarajan07proactive.ppt.pdf
Proactive Fault Tolerance for HPC using Xen virtualization
Its something I've wanted to see working - doing a Xen live migration
of a 'dodgy' compute node, and the job just keeps on trucking.
Looks as if these guys have it working. Anyone else seen similar?
John Hearns
More information about the Beowulf
mailing list