[Beowulf] GPFS and failed metadata NSD

Alex Chekholko alex.chekholko at gmail.com
Sat Apr 29 12:31:29 PDT 2017


Looking at that disk config, your metadata was striped across 4 devices and
you lost 1/4 of that.  Not much you can do to come back from that.

I had a similar but easier situation in the past where I lost some data
disks (but not metadata!) and using some low-level tools one can scan the
GPFS medatada and make a list of the files which had data blocks on the
data disks that I lost.  And then restore/delete just those files.

But in your case, you are out of luck.  I'm not sure what the behavior
would be after you mmdeldisk that disk, but I imagine you will not be able
to mount the fs after that, and your only option will be mmdelfs.

Re "BeeGFS over ZFS" vs "GPFS" I think you fill find the corner-case
failure modes are not that much simpler in either case.  "Better the devil
you know..."


On Sat, Apr 29, 2017 at 11:14 AM Evan Burness <
evan.burness at cyclecomputing.com> wrote:

> ;-)
>
> On Sat, Apr 29, 2017 at 1:12 PM, John Hanks <griznog at gmail.com> wrote:
>
>> Thanks for the suggestions, but when this Phoenix rises from the ashes it
>> will be running BeeGFS over ZFS. The more I learn about GPFS the more I am
>> reminded of quote seen recently on twitter:
>>
>> "People bred, selected, and compensated to find complicated solutions do
>> not have an incentive to implement simplified ones." -- @nntaleb
>> <https://twitter.com/nntaleb>
>>
>> You can only read "you should contact support" so many times in
>> documentation and forum posts before you remember "oh yeah, IBM is a
>> _services_ company."
>>
>> jbh
>>
>>
>> On Sat, Apr 29, 2017 at 8:58 PM Evan Burness <
>> evan.burness at cyclecomputing.com> wrote:
>>
>>> Hi John,
>>>
>>> Yeah, I think the best word here is "ouch" unfortunately. I asked a few
>>> of my GPFS-savvy colleagues and they all agreed there aren't many good
>>> options here.
>>>
>>> The one "suggestion" (I promise, no Monday morning quarterbacking) I and
>>> my storage admins friends can offer, if you have the ability to do so (both
>>> from a technical but also from a procurement/policy change standpoint) is
>>> to swap out spinning drives for NVMe ones for your metadata servers. Yes,
>>> you'll still take the write performance hit from replication relative to a
>>> non-replicated state, but modern NAND and NVMe drives are so fast and low
>>> latency that it will still be as fast or faster than the replicated,
>>> spinning disk approach it sounds like (please forgive me if I'm
>>> misunderstanding this piece).
>>>
>>> We took this very approach on a 10+ petabyte DDN SFA14k running GPFS
>>> 4.2.1 that was designed to house research and clinical data for a large US
>>> hospital. They had 600+ million files b/t 0-10 MB, so we had high-end
>>> requirements for both metadata performance AND reliability. Like you, we
>>> tagged 4 GPFS NSD's with metadata duty and gave each a 1.6 TB Intel P3608
>>> NVMe disk, and the performance was still exceptionally good even with
>>> replication because these modern drives are such fire-breathing IOPS
>>> monsters. If you don't have as much data as this scenario, you could
>>> definitely get away with 400 or 800 GB versions and save yourself a fair
>>> amount of $$.
>>>
>>> Also, if you're looking to experiment with whether a replicated approach
>>> can meet your needs, I suggest you check out AWS' I3 instances for
>>> short-term testing. They have up to 8 * 1.9 TB NVMe drives. At Cycle
>>> Computing we've helped a number of .com's and .edu's address high-end IO
>>> needs using these or similar instances. If you have a decent background
>>> with filesystems these cloud instances can be excellent performers, either
>>> for test/lab scenarios like this or production environments.
>>>
>>> Hope this helps!
>>>
>>>
>>> Best,
>>>
>>> Evan Burness
>>>
>>> -------------------------
>>> Evan Burness
>>> Director, HPC
>>> Cycle Computing
>>> evan.burness at cyclecomputing.com
>>> (919) 724-9338
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Apr 29, 2017 at 11:13 AM, John Hanks <griznog at gmail.com> wrote:
>>>
>>>> There are no dumb questions in this snafu, I have already covered the
>>>> dumb aspects adequately :)
>>>>
>>>> Replication was not enabled, this was scratch space set up to be as
>>>> large and fast as possible. The fact that I can say "it was scratch"
>>>> doesn't make it sting less, thus the grasping at straws.
>>>> jbh
>>>>
>>>> On Sat, Apr 29, 2017, 7:05 PM Evan Burness <
>>>> evan.burness at cyclecomputing.com> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> I'm not a GPFS expert, but I did manage some staff that ran GPFS
>>>>> filesystems while I was at NCSA. Those folks reeeaaalllly knew what they
>>>>> were doing.
>>>>>
>>>>> Perhaps a dumb question, but should we infer from your note that
>>>>> metadata replication is not enabled across those 4 NSDs handling it?
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>> Evan
>>>>>
>>>>>
>>>>> -------------------------
>>>>> Evan Burness
>>>>> Director, HPC
>>>>> Cycle Computing
>>>>> evan.burness at cyclecomputing.com
>>>>> (919) 724-9338
>>>>>
>>>>> On Sat, Apr 29, 2017 at 9:36 AM, Peter St. John <
>>>>> peter.st.john at gmail.com> wrote:
>>>>>
>>>>>> just a friendly reminder that while the probability of a particular
>>>>>> coincidence might be very low, the probability that there will be **some**
>>>>>> coincidence is very high.
>>>>>>
>>>>>> Peter (pedant)
>>>>>>
>>>>>> On Sat, Apr 29, 2017 at 3:00 AM, John Hanks <griznog at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm not getting much useful vendor information so I thought I'd ask
>>>>>>> here in the hopes that a GPFS expert can offer some advice. We have a GPFS
>>>>>>> system which has the following disk config:
>>>>>>>
>>>>>>> [root at grsnas01 ~]# mmlsdisk grsnas_data
>>>>>>> disk         driver   sector     failure holds    holds
>>>>>>>                storage
>>>>>>> name         type       size       group metadata data  status
>>>>>>>  availability pool
>>>>>>> ------------ -------- ------ ----------- -------- -----
>>>>>>> ------------- ------------ ------------
>>>>>>> SAS_NSD_00   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_01   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_02   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_03   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_04   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_05   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_06   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_07   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_08   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_09   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_10   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_11   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_12   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_13   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_14   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_15   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_16   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_17   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_18   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_19   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_20   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SAS_NSD_21   nsd         512         100 No       Yes   ready
>>>>>>>   up           system
>>>>>>> SSD_NSD_23   nsd         512         200 Yes      No    ready
>>>>>>>   up           system
>>>>>>> SSD_NSD_24   nsd         512         200 Yes      No    ready
>>>>>>>   up           system
>>>>>>> SSD_NSD_25   nsd         512         200 Yes      No    to be
>>>>>>> emptied down         system
>>>>>>> SSD_NSD_26   nsd         512         200 Yes      No    ready
>>>>>>>   up           system
>>>>>>>
>>>>>>> SSD_NSD_25 is a mirror in which both drives have failed due to a
>>>>>>> series of unfortunate events and will not be coming back. From the GPFS
>>>>>>> troubleshooting guide it appears that my only alternative is to run
>>>>>>>
>>>>>>> mmdeldisk grsnas_data  SSD_NSD_25 -p
>>>>>>>
>>>>>>> around which the documentation also warns is irreversible, the sky
>>>>>>> is likely to fall, dogs and cats sleeping together, etc. But at this point
>>>>>>> I'm already in an irreversible situation. Of course this is a scratch
>>>>>>> filesystem, of course people were warned repeatedly about the risk of using
>>>>>>> a scratch filesystem that is not backed up and of course many ignored that.
>>>>>>> I'd like to recover as much as possible here. Can anyone confirm/reject
>>>>>>> that deleting this disk is the best way forward or if there are other
>>>>>>> alternatives to recovering data from GPFS in this situation?
>>>>>>>
>>>>>>> Any input is appreciated. Adding salt to the wound is that until a
>>>>>>> few months ago I had a complete copy of this filesystem that I had made
>>>>>>> onto some new storage as a burn-in test but then removed as that storage
>>>>>>> was consumed... As they say, sometimes you eat the bear, and sometimes,
>>>>>>> well, the bear eats you.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> jbh
>>>>>>>
>>>>>>> (Naively calculated probability of these two disks failing close
>>>>>>> together in this array: 0.00001758. I never get this lucky when buying
>>>>>>> lottery tickets.)
>>>>>>> --
>>>>>>> ‘[A] talent for following the ways of yesterday, is not sufficient
>>>>>>> to improve the world of today.’
>>>>>>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>>>>> Computing
>>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>>>> Computing
>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Evan Burness
>>>>> Director, HPC Solutions
>>>>> Cycle Computing
>>>>> evan.burness at cyclecomputing.com
>>>>> (919) 724-9338
>>>>>
>>>> --
>>>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>>>> improve the world of today.’
>>>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>>>
>>>
>>>
>>>
>>> --
>>> Evan Burness
>>> Director, HPC Solutions
>>> Cycle Computing
>>> evan.burness at cyclecomputing.com
>>> (919) 724-9338
>>>
>> --
>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>> improve the world of today.’
>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>
>
>
>
> --
> Evan Burness
> Director, HPC Solutions
> Cycle Computing
> evan.burness at cyclecomputing.com
> (919) 724-9338
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170429/85a48725/attachment-0001.html>


More information about the Beowulf mailing list