[Beowulf] GPFS and failed metadata NSD
Evan Burness
evan.burness at cyclecomputing.com
Sat Apr 29 10:57:41 PDT 2017
Hi John,
Yeah, I think the best word here is "ouch" unfortunately. I asked a few of
my GPFS-savvy colleagues and they all agreed there aren't many good options
here.
The one "suggestion" (I promise, no Monday morning quarterbacking) I and my
storage admins friends can offer, if you have the ability to do so (both
from a technical but also from a procurement/policy change standpoint) is
to swap out spinning drives for NVMe ones for your metadata servers. Yes,
you'll still take the write performance hit from replication relative to a
non-replicated state, but modern NAND and NVMe drives are so fast and low
latency that it will still be as fast or faster than the replicated,
spinning disk approach it sounds like (please forgive me if I'm
misunderstanding this piece).
We took this very approach on a 10+ petabyte DDN SFA14k running GPFS 4.2.1
that was designed to house research and clinical data for a large US
hospital. They had 600+ million files b/t 0-10 MB, so we had high-end
requirements for both metadata performance AND reliability. Like you, we
tagged 4 GPFS NSD's with metadata duty and gave each a 1.6 TB Intel P3608
NVMe disk, and the performance was still exceptionally good even with
replication because these modern drives are such fire-breathing IOPS
monsters. If you don't have as much data as this scenario, you could
definitely get away with 400 or 800 GB versions and save yourself a fair
amount of $$.
Also, if you're looking to experiment with whether a replicated approach
can meet your needs, I suggest you check out AWS' I3 instances for
short-term testing. They have up to 8 * 1.9 TB NVMe drives. At Cycle
Computing we've helped a number of .com's and .edu's address high-end IO
needs using these or similar instances. If you have a decent background
with filesystems these cloud instances can be excellent performers, either
for test/lab scenarios like this or production environments.
Hope this helps!
Best,
Evan Burness
-------------------------
Evan Burness
Director, HPC
Cycle Computing
evan.burness at cyclecomputing.com
(919) 724-9338
On Sat, Apr 29, 2017 at 11:13 AM, John Hanks <griznog at gmail.com> wrote:
> There are no dumb questions in this snafu, I have already covered the dumb
> aspects adequately :)
>
> Replication was not enabled, this was scratch space set up to be as large
> and fast as possible. The fact that I can say "it was scratch" doesn't make
> it sting less, thus the grasping at straws.
> jbh
>
> On Sat, Apr 29, 2017, 7:05 PM Evan Burness <evan.burness at cyclecomputing.
> com> wrote:
>
>> Hi John,
>>
>> I'm not a GPFS expert, but I did manage some staff that ran GPFS
>> filesystems while I was at NCSA. Those folks reeeaaalllly knew what they
>> were doing.
>>
>> Perhaps a dumb question, but should we infer from your note that metadata
>> replication is not enabled across those 4 NSDs handling it?
>>
>>
>> Best,
>>
>> Evan
>>
>>
>> -------------------------
>> Evan Burness
>> Director, HPC
>> Cycle Computing
>> evan.burness at cyclecomputing.com
>> (919) 724-9338
>>
>> On Sat, Apr 29, 2017 at 9:36 AM, Peter St. John <peter.st.john at gmail.com>
>> wrote:
>>
>>> just a friendly reminder that while the probability of a particular
>>> coincidence might be very low, the probability that there will be **some**
>>> coincidence is very high.
>>>
>>> Peter (pedant)
>>>
>>> On Sat, Apr 29, 2017 at 3:00 AM, John Hanks <griznog at gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm not getting much useful vendor information so I thought I'd ask
>>>> here in the hopes that a GPFS expert can offer some advice. We have a GPFS
>>>> system which has the following disk config:
>>>>
>>>> [root at grsnas01 ~]# mmlsdisk grsnas_data
>>>> disk driver sector failure holds holds
>>>> storage
>>>> name type size group metadata data status
>>>> availability pool
>>>> ------------ -------- ------ ----------- -------- ----- -------------
>>>> ------------ ------------
>>>> SAS_NSD_00 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_01 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_02 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_03 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_04 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_05 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_06 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_07 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_08 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_09 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_10 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_11 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_12 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_13 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_14 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_15 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_16 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_17 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_18 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_19 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_20 nsd 512 100 No Yes ready
>>>> up system
>>>> SAS_NSD_21 nsd 512 100 No Yes ready
>>>> up system
>>>> SSD_NSD_23 nsd 512 200 Yes No ready
>>>> up system
>>>> SSD_NSD_24 nsd 512 200 Yes No ready
>>>> up system
>>>> SSD_NSD_25 nsd 512 200 Yes No to be emptied
>>>> down system
>>>> SSD_NSD_26 nsd 512 200 Yes No ready
>>>> up system
>>>>
>>>> SSD_NSD_25 is a mirror in which both drives have failed due to a series
>>>> of unfortunate events and will not be coming back. From the GPFS
>>>> troubleshooting guide it appears that my only alternative is to run
>>>>
>>>> mmdeldisk grsnas_data SSD_NSD_25 -p
>>>>
>>>> around which the documentation also warns is irreversible, the sky is
>>>> likely to fall, dogs and cats sleeping together, etc. But at this point I'm
>>>> already in an irreversible situation. Of course this is a scratch
>>>> filesystem, of course people were warned repeatedly about the risk of using
>>>> a scratch filesystem that is not backed up and of course many ignored that.
>>>> I'd like to recover as much as possible here. Can anyone confirm/reject
>>>> that deleting this disk is the best way forward or if there are other
>>>> alternatives to recovering data from GPFS in this situation?
>>>>
>>>> Any input is appreciated. Adding salt to the wound is that until a few
>>>> months ago I had a complete copy of this filesystem that I had made onto
>>>> some new storage as a burn-in test but then removed as that storage was
>>>> consumed... As they say, sometimes you eat the bear, and sometimes, well,
>>>> the bear eats you.
>>>>
>>>> Thanks,
>>>>
>>>> jbh
>>>>
>>>> (Naively calculated probability of these two disks failing close
>>>> together in this array: 0.00001758. I never get this lucky when buying
>>>> lottery tickets.)
>>>> --
>>>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>>>> improve the world of today.’
>>>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>>>
>>>> _______________________________________________
>>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>>> Computing
>>>> To change your subscription (digest mode or unsubscribe) visit
>>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>>
>>
>>
>> --
>> Evan Burness
>> Director, HPC Solutions
>> Cycle Computing
>> evan.burness at cyclecomputing.com
>> (919) 724-9338
>>
> --
> ‘[A] talent for following the ways of yesterday, is not sufficient to
> improve the world of today.’
> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>
--
Evan Burness
Director, HPC Solutions
Cycle Computing
evan.burness at cyclecomputing.com
(919) 724-9338
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170429/62d87f74/attachment-0001.html>
More information about the Beowulf
mailing list