[Beowulf] HPC fault tolerance using virtualization

Tue Jun 16 02:02:11 PDT 2009

2009/6/16 Kilian CAVALOTTI <kilian.cavalotti.work at gmail.com>

>
>
> I may be missing something major here, but if there's bad hardware, chances
> are the job has already failed from it, right? Would it be a bad disk (and
> the
> OS would only notice a bad disk while trying to write on it, likely asked
> to
> do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything
> hardware
> losing bits mainly manifests itself in software errors. There is very
> little
> chance to spot a bad DIMM until something (like a job) tries to write to
> it.

What you say is very true.

However, you could look of correctable ECC errors, and for disks run a
smartctl test and see if a disk is showing
symtopms which might make it fail in future.
Or maybe look at the error rates on your ethernet or infiniband interface -
you might want to take that node out till it can be investigated (read-
reseating the cable!)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20090616/4c0f1644/attachment.html>