[Beowulf] GPFS and failed metadata NSD

Sat Apr 29 11:13:49 PDT 2017

;-)

On Sat, Apr 29, 2017 at 1:12 PM, John Hanks <griznog at gmail.com> wrote:

> Thanks for the suggestions, but when this Phoenix rises from the ashes it
> will be running BeeGFS over ZFS. The more I learn about GPFS the more I am
> reminded of quote seen recently on twitter:
>
> "People bred, selected, and compensated to find complicated solutions do
> not have an incentive to implement simplified ones." -- @nntaleb
> <https://twitter.com/nntaleb>
>
> You can only read "you should contact support" so many times in
> documentation and forum posts before you remember "oh yeah, IBM is a
> _services_ company."
>
> jbh
>
>
> On Sat, Apr 29, 2017 at 8:58 PM Evan Burness <evan.burness at cyclecomputing.
> com> wrote:
>
>> Hi John,
>>
>> Yeah, I think the best word here is "ouch" unfortunately. I asked a few
>> of my GPFS-savvy colleagues and they all agreed there aren't many good
>> options here.
>>
>> The one "suggestion" (I promise, no Monday morning quarterbacking) I and
>> my storage admins friends can offer, if you have the ability to do so (both
>> from a technical but also from a procurement/policy change standpoint) is
>> to swap out spinning drives for NVMe ones for your metadata servers. Yes,
>> you'll still take the write performance hit from replication relative to a
>> non-replicated state, but modern NAND and NVMe drives are so fast and low
>> latency that it will still be as fast or faster than the replicated,
>> spinning disk approach it sounds like (please forgive me if I'm
>> misunderstanding this piece).
>>
>> We took this very approach on a 10+ petabyte DDN SFA14k running GPFS
>> 4.2.1 that was designed to house research and clinical data for a large US
>> hospital. They had 600+ million files b/t 0-10 MB, so we had high-end
>> requirements for both metadata performance AND reliability. Like you, we
>> tagged 4 GPFS NSD's with metadata duty and gave each a 1.6 TB Intel P3608
>> NVMe disk, and the performance was still exceptionally good even with
>> replication because these modern drives are such fire-breathing IOPS
>> monsters. If you don't have as much data as this scenario, you could
>> definitely get away with 400 or 800 GB versions and save yourself a fair
>> amount of $$.
>>
>> Also, if you're looking to experiment with whether a replicated approach
>> can meet your needs, I suggest you check out AWS' I3 instances for
>> short-term testing. They have up to 8 * 1.9 TB NVMe drives. At Cycle
>> Computing we've helped a number of .com's and .edu's address high-end IO
>> needs using these or similar instances. If you have a decent background
>> with filesystems these cloud instances can be excellent performers, either
>> for test/lab scenarios like this or production environments.
>>
>> Hope this helps!
>>
>>
>> Best,
>>
>> Evan Burness
>>
>> -------------------------
>> Evan Burness
>> Director, HPC
>> Cycle Computing
>> evan.burness at cyclecomputing.com
>> (919) 724-9338
>>
>>
>>
>>
>>
>> On Sat, Apr 29, 2017 at 11:13 AM, John Hanks <griznog at gmail.com> wrote:
>>
>>> There are no dumb questions in this snafu, I have already covered the
>>> dumb aspects adequately :)
>>>
>>> Replication was not enabled, this was scratch space set up to be as
>>> large and fast as possible. The fact that I can say "it was scratch"
>>> doesn't make it sting less, thus the grasping at straws.
>>> jbh
>>>
>>> On Sat, Apr 29, 2017, 7:05 PM Evan Burness <evan.burness at cyclecomputing.
>>> com> wrote:
>>>
>>>> Hi John,
>>>>
>>>> I'm not a GPFS expert, but I did manage some staff that ran GPFS
>>>> filesystems while I was at NCSA. Those folks reeeaaalllly knew what they
>>>> were doing.
>>>>
>>>> Perhaps a dumb question, but should we infer from your note that
>>>> metadata replication is not enabled across those 4 NSDs handling it?
>>>>
>>>>
>>>> Best,
>>>>
>>>> Evan
>>>>
>>>>
>>>> -------------------------
>>>> Evan Burness
>>>> Director, HPC
>>>> Cycle Computing
>>>> evan.burness at cyclecomputing.com
>>>> (919) 724-9338
>>>>
>>>> On Sat, Apr 29, 2017 at 9:36 AM, Peter St. John <
>>>> peter.st.john at gmail.com> wrote:
>>>>
>>>>> just a friendly reminder that while the probability of a particular
>>>>> coincidence might be very low, the probability that there will be **some**
>>>>> coincidence is very high.
>>>>>
>>>>> Peter (pedant)
>>>>>
>>>>> On Sat, Apr 29, 2017 at 3:00 AM, John Hanks <griznog at gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm not getting much useful vendor information so I thought I'd ask
>>>>>> here in the hopes that a GPFS expert can offer some advice. We have a GPFS
>>>>>> system which has the following disk config:
>>>>>>
>>>>>> [root at grsnas01 ~]# mmlsdisk grsnas_data
>>>>>> disk         driver   sector     failure holds    holds
>>>>>>              storage
>>>>>> name         type       size       group metadata data  status
>>>>>>  availability pool
>>>>>> ------------ -------- ------ ----------- -------- ----- -------------
>>>>>> ------------ ------------
>>>>>> SAS_NSD_00   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_01   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_02   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_03   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_04   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_05   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_06   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_07   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_08   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_09   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_10   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_11   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_12   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_13   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_14   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_15   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_16   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_17   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_18   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_19   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_20   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SAS_NSD_21   nsd         512         100 No       Yes   ready
>>>>>> up           system
>>>>>> SSD_NSD_23   nsd         512         200 Yes      No    ready
>>>>>> up           system
>>>>>> SSD_NSD_24   nsd         512         200 Yes      No    ready
>>>>>> up           system
>>>>>> SSD_NSD_25   nsd         512         200 Yes      No    to be emptied
>>>>>> down         system
>>>>>> SSD_NSD_26   nsd         512         200 Yes      No    ready
>>>>>> up           system
>>>>>>
>>>>>> SSD_NSD_25 is a mirror in which both drives have failed due to a
>>>>>> series of unfortunate events and will not be coming back. From the GPFS
>>>>>> troubleshooting guide it appears that my only alternative is to run
>>>>>>
>>>>>> mmdeldisk grsnas_data  SSD_NSD_25 -p
>>>>>>
>>>>>> around which the documentation also warns is irreversible, the sky is
>>>>>> likely to fall, dogs and cats sleeping together, etc. But at this point I'm
>>>>>> already in an irreversible situation. Of course this is a scratch
>>>>>> filesystem, of course people were warned repeatedly about the risk of using
>>>>>> a scratch filesystem that is not backed up and of course many ignored that.
>>>>>> I'd like to recover as much as possible here. Can anyone confirm/reject
>>>>>> that deleting this disk is the best way forward or if there are other
>>>>>> alternatives to recovering data from GPFS in this situation?
>>>>>>
>>>>>> Any input is appreciated. Adding salt to the wound is that until a
>>>>>> few months ago I had a complete copy of this filesystem that I had made
>>>>>> onto some new storage as a burn-in test but then removed as that storage
>>>>>> was consumed... As they say, sometimes you eat the bear, and sometimes,
>>>>>> well, the bear eats you.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> jbh
>>>>>>
>>>>>> (Naively calculated probability of these two disks failing close
>>>>>> together in this array: 0.00001758. I never get this lucky when buying
>>>>>> lottery tickets.)
>>>>>> --
>>>>>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>>>>>> improve the world of today.’
>>>>>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>>>>>
>>>>>> _______________________________________________
>>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>>>> Computing
>>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>>> Computing
>>>>> To change your subscription (digest mode or unsubscribe) visit
>>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Evan Burness
>>>> Director, HPC Solutions
>>>> Cycle Computing
>>>> evan.burness at cyclecomputing.com
>>>> (919) 724-9338
>>>>
>>> --
>>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>>> improve the world of today.’
>>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>>
>>
>>
>>
>> --
>> Evan Burness
>> Director, HPC Solutions
>> Cycle Computing
>> evan.burness at cyclecomputing.com
>> (919) 724-9338
>>
> --
> ‘[A] talent for following the ways of yesterday, is not sufficient to
> improve the world of today.’
>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>

-- 
Evan Burness
Director, HPC Solutions
Cycle Computing
evan.burness at cyclecomputing.com
(919) 724-9338
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170429/d5a6f614/attachment-0001.html>