[Beowulf] GPFS and failed metadata NSD

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Mon May 22 16:01:24 PDT 2017






On 5/22/17, 4:30 AM, "Beowulf on behalf of Michael Di Domenico"
<beowulf-bounces at beowulf.org on behalf of mdidomenico4 at gmail.com> wrote:

>On Sun, May 21, 2017 at 9:37 PM, Christopher Samuel
><samuel at unimelb.edu.au> wrote:
>> On 01/05/17 21:40, John Hearns wrote:
>>
>>> Also remember that pairs of disks probably came off the production line
>>> as similar times. So this is probabyl a twins paradox!
>>
>> At my previous HPC gig we lost 2 drives in a RAID-5 array within a few
>> minutes of each other. They were manufactured on the same day.
>
>all of my contracts have a "part-randomization" clause in them to
>ensure vendor's randomize the batches they pull parts from to build
>the machines.  hopefully everyone else's does too... :)



How do you verify that they really are from different batches (in a
significant way)?  Assembled on different days? Or what?  I can easily see
a mfr buying a weeks worth or a months worth of parts in one lot - and
those will all have essentially the same characteristics.  Date codes on
IC parts are typically just year and week, and relating that back to an
actual production lot run is non-trivial.

In the space business, we do a lot of lot tracking and so forth, and
a) it isn¹t cheap
b) it isn¹t always available
c) it isn¹t necessarily meaningful

That way when you get the alert that a bad batch of 2N2222 transistors has
been found, you can check the as-built docs for your spacecraft on the way
to Europa and breathe a sigh of relief that you didn¹t use one of ³those
units².

And we could get into a discussion of what actually fails:  Is it some
component? active or passive? is it a component failure or a latent
manufacturing defect (a cracked solder joint that finally lets go or
something like that) or a damage thing (that bolt of lightning that struck
next to the UPS truck carrying your lot identical units).

I suspect the manufacturers have fairly good tracking on failure
statistics, but they probably don¹t do actual failure analysis on failed
units - unless there¹s a spike in the failure rates.  I wonder if the
quoted MTBFs on drives are ³by handbook² or ³by test², and of course, if
it¹s the latter, it¹s probably an accelerated test (at high temperatures),
and that gets into the assumptions about temperature effects on failure
rates (yeah, the rule of thumb is 10 degrees is half the life, but that¹s
not always the exponent.. there¹s fine art in choosing the appropriate
Activation Energy)






More information about the Beowulf mailing list