[Beowulf] GPFS and failed metadata NSD

Sat Apr 29 11:12:32 PDT 2017

Thanks for the suggestions, but when this Phoenix rises from the ashes it
will be running BeeGFS over ZFS. The more I learn about GPFS the more I am
reminded of quote seen recently on twitter:

"People bred, selected, and compensated to find complicated solutions do
not have an incentive to implement simplified ones." -- @nntaleb
<https://twitter.com/nntaleb>

You can only read "you should contact support" so many times in
documentation and forum posts before you remember "oh yeah, IBM is a
_services_ company."

jbh

On Sat, Apr 29, 2017 at 8:58 PM Evan Burness <
evan.burness at cyclecomputing.com> wrote:

> Hi John,
>
> Yeah, I think the best word here is "ouch" unfortunately. I asked a few of
> my GPFS-savvy colleagues and they all agreed there aren't many good options
> here.
>
> The one "suggestion" (I promise, no Monday morning quarterbacking) I and
> my storage admins friends can offer, if you have the ability to do so (both
> from a technical but also from a procurement/policy change standpoint) is
> to swap out spinning drives for NVMe ones for your metadata servers. Yes,
> you'll still take the write performance hit from replication relative to a
> non-replicated state, but modern NAND and NVMe drives are so fast and low
> latency that it will still be as fast or faster than the replicated,
> spinning disk approach it sounds like (please forgive me if I'm
> misunderstanding this piece).
>
> We took this very approach on a 10+ petabyte DDN SFA14k running GPFS 4.2.1
> that was designed to house research and clinical data for a large US
> hospital. They had 600+ million files b/t 0-10 MB, so we had high-end
> requirements for both metadata performance AND reliability. Like you, we
> tagged 4 GPFS NSD's with metadata duty and gave each a 1.6 TB Intel P3608
> NVMe disk, and the performance was still exceptionally good even with
> replication because these modern drives are such fire-breathing IOPS
> monsters. If you don't have as much data as this scenario, you could
> definitely get away with 400 or 800 GB versions and save yourself a fair
> amount of $$.
>
> Also, if you're looking to experiment with whether a replicated approach
> can meet your needs, I suggest you check out AWS' I3 instances for
> short-term testing. They have up to 8 * 1.9 TB NVMe drives. At Cycle
> Computing we've helped a number of .com's and .edu's address high-end IO
> needs using these or similar instances. If you have a decent background
> with filesystems these cloud instances can be excellent performers, either
> for test/lab scenarios like this or production environments.
>
> Hope this helps!
>
>
> Best,
>
> Evan Burness
>
> -------------------------
> Evan Burness
> Director, HPC
> Cycle Computing
> evan.burness at cyclecomputing.com
> (919) 724-9338
>
>
>
>
>
> On Sat, Apr 29, 2017 at 11:13 AM, John Hanks <griznog at gmail.com> wrote:
>
>> There are no dumb questions in this snafu, I have already covered the
>> dumb aspects adequately :)
>>
>> Replication was not enabled, this was scratch space set up to be as large
>> and fast as possible. The fact that I can say "it was scratch" doesn't make
>> it sting less, thus the grasping at straws.
>> jbh
>>
>> On Sat, Apr 29, 2017, 7:05 PM Evan Burness <
>> evan.burness at cyclecomputing.com> wrote:
>>
>>> Hi John,
>>>
>>> I'm not a GPFS expert, but I did manage some staff that ran GPFS
>>> filesystems while I was at NCSA. Those folks reeeaaalllly knew what they
>>> were doing.
>>>
>>> Perhaps a dumb question, but should we infer from your note that
>>> metadata replication is not enabled across those 4 NSDs handling it?
>>>
>>>
>>> Best,
>>>
>>> Evan
>>>
>>>
>>> -------------------------
>>> Evan Burness
>>> Director, HPC
>>> Cycle Computing
>>> evan.burness at cyclecomputing.com
>>> (919) 724-9338
>>>
>>> On Sat, Apr 29, 2017 at 9:36 AM, Peter St. John <peter.st.john at gmail.com
>>> > wrote:
>>>
>>>> just a friendly reminder that while the probability of a particular
>>>> coincidence might be very low, the probability that there will be **some**
>>>> coincidence is very high.
>>>>
>>>> Peter (pedant)
>>>>
>>>> On Sat, Apr 29, 2017 at 3:00 AM, John Hanks <griznog at gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm not getting much useful vendor information so I thought I'd ask
>>>>> here in the hopes that a GPFS expert can offer some advice. We have a GPFS
>>>>> system which has the following disk config:
>>>>>
>>>>> [root at grsnas01 ~]# mmlsdisk grsnas_data
>>>>> disk         driver   sector     failure holds    holds
>>>>>              storage
>>>>> name         type       size       group metadata data  status
>>>>>  availability pool
>>>>> ------------ -------- ------ ----------- -------- ----- -------------
>>>>> ------------ ------------
>>>>> SAS_NSD_00   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_01   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_02   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_03   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_04   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_05   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_06   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_07   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_08   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_09   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_10   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_11   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_12   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_13   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_14   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_15   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_16   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_17   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_18   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_19   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_20   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SAS_NSD_21   nsd         512         100 No       Yes   ready
>>>>> up           system
>>>>> SSD_NSD_23   nsd         512         200 Yes      No    ready
>>>>> up           system
>>>>> SSD_NSD_24   nsd         512         200 Yes      No    ready
>>>>> up           system
>>>>> SSD_NSD_25   nsd         512         200 Yes      No    to be emptied
>>>>> down         system
>>>>> SSD_NSD_26   nsd         512         200 Yes      No    ready
>>>>> up           system
>>>>>
>>>>> SSD_NSD_25 is a mirror in which both drives have failed due to a
>>>>> series of unfortunate events and will not be coming back. From the GPFS
>>>>> troubleshooting guide it appears that my only alternative is to run
>>>>>
>>>>> mmdeldisk grsnas_data  SSD_NSD_25 -p
>>>>>
>>>>> around which the documentation also warns is irreversible, the sky is
>>>>> likely to fall, dogs and cats sleeping together, etc. But at this point I'm
>>>>> already in an irreversible situation. Of course this is a scratch
>>>>> filesystem, of course people were warned repeatedly about the risk of using
>>>>> a scratch filesystem that is not backed up and of course many ignored that.
>>>>> I'd like to recover as much as possible here. Can anyone confirm/reject
>>>>> that deleting this disk is the best way forward or if there are other
>>>>> alternatives to recovering data from GPFS in this situation?
>>>>>
>>>>> Any input is appreciated. Adding salt to the wound is that until a few
>>>>> months ago I had a complete copy of this filesystem that I had made onto
>>>>> some new storage as a burn-in test but then removed as that storage was
>>>>> consumed... As they say, sometimes you eat the bear, and sometimes, well,
>>>>> the bear eats you.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> jbh
>>>>>
>>>>> (Naively calculated probability of these two disks failing close
>>>>> together in this array: 0.00001758. I never get this lucky when buying
>>>>> lottery tickets.)
>>>>> --
>>>>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>>>>> improve the world of today.’
>>>>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>>>>
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>>> Computing
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>> Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>
>>>>
>>>
>>>
>>> --
>>> Evan Burness
>>> Director, HPC Solutions
>>> Cycle Computing
>>> evan.burness at cyclecomputing.com
>>> (919) 724-9338
>>>
>> --
>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>> improve the world of today.’
>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>
>
>
>
> --
> Evan Burness
> Director, HPC Solutions
> Cycle Computing
> evan.burness at cyclecomputing.com
> (919) 724-9338
>
-- 
‘[A] talent for following the ways of yesterday, is not sufficient to
improve the world of today.’
 - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170429/e5a633c0/attachment-0001.html>