<br><br>

<div class="gmail_quote">2009/6/16 Kilian CAVALOTTI <span dir="ltr"><<a href="mailto:kilian.cavalotti.work@gmail.com">kilian.cavalotti.work@gmail.com</a>></span><br>

<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">

<div class="im"><br> </div>I may be missing something major here, but if there's bad hardware, chances<br>are the job has already failed from it, right? Would it be a bad disk (and the<br>OS would only notice a bad disk while trying to write on it, likely asked to<br>

do so by the job), or bad memory, or bad CPU, or faulty PSU. Anything hardware<br>losing bits mainly manifests itself in software errors. There is very little<br>chance to spot a bad DIMM until something (like a job) tries to write to it.</blockquote>


<div> </div>

<div>What you say is very true.</div>

<div> </div>

<div>However, you could look of correctable ECC errors, and for disks run a smartctl test and see if a disk is showing </div>

<div>symtopms which might make it fail in future.</div>

<div>Or maybe look at the error rates on your ethernet or infiniband interface - you might want to take that node out till it can be investigated (read- reseating the cable!)</div>

<div> </div>

<div> </div>

<div> </div>

<div> </div>

<div><br> </div></div>