[Beowulf] Re: HPC fault tolerance using virtualization)

Dave Love d.love at liverpool.ac.uk
Mon Jun 29 05:33:37 PDT 2009

Greg Lindahl <lindahl at pbm.com> writes:

>> What I typically see from smartd is alerts when one or more sectors has
>> already gone bad, although that tends not to be something that will
>> clobber the running job.  How should it be configured to do better
>> (without noise)?
> That isn't noise, that's signal.

Of course I didn't mean that bad block alerts were noise.  However,
there is what I and a hardware expert think is noise from the default
smartd configuration.  I'm interested in how best to configure it for
useful warnings.  I did have a look OTW, of course.

> You're just lucky that your running
> job doesn't need the data off the bad sector.

Not if the problem is, say, on /usr, which the job normally isn't going
to need before it finishes.

> You can try waiting
> until the job finishes before taking the node out of service; from the
> sounds of it, you will usually win. But if you don't have
> application-level end-to-end checksums of your data, how do you know
> if you won or not?

I know where the job is doing i/o, and I'm not going to kill multi-day,
multi-node jobs -- especially not automatically -- because there's a bad
sector somewhere irrelevant.  Also we have better things to worry about
here, at least, than application checksums, much as they might feature
in an ideal world.

More information about the Beowulf mailing list