[Beowulf] GPFS and failed metadata NSD

Sat Apr 29 09:13:10 PDT 2017

There are no dumb questions in this snafu, I have already covered the dumb
aspects adequately :)

Replication was not enabled, this was scratch space set up to be as large
and fast as possible. The fact that I can say "it was scratch" doesn't make
it sting less, thus the grasping at straws.
jbh

On Sat, Apr 29, 2017, 7:05 PM Evan Burness <evan.burness at cyclecomputing.com>
wrote:

> Hi John,
>
> I'm not a GPFS expert, but I did manage some staff that ran GPFS
> filesystems while I was at NCSA. Those folks reeeaaalllly knew what they
> were doing.
>
> Perhaps a dumb question, but should we infer from your note that metadata
> replication is not enabled across those 4 NSDs handling it?
>
>
> Best,
>
> Evan
>
>
> -------------------------
> Evan Burness
> Director, HPC
> Cycle Computing
> evan.burness at cyclecomputing.com
> (919) 724-9338
>
> On Sat, Apr 29, 2017 at 9:36 AM, Peter St. John <peter.st.john at gmail.com>
> wrote:
>
>> just a friendly reminder that while the probability of a particular
>> coincidence might be very low, the probability that there will be **some**
>> coincidence is very high.
>>
>> Peter (pedant)
>>
>> On Sat, Apr 29, 2017 at 3:00 AM, John Hanks <griznog at gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm not getting much useful vendor information so I thought I'd ask here
>>> in the hopes that a GPFS expert can offer some advice. We have a GPFS
>>> system which has the following disk config:
>>>
>>> [root at grsnas01 ~]# mmlsdisk grsnas_data
>>> disk         driver   sector     failure holds    holds
>>>            storage
>>> name         type       size       group metadata data  status
>>>  availability pool
>>> ------------ -------- ------ ----------- -------- ----- -------------
>>> ------------ ------------
>>> SAS_NSD_00   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_01   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_02   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_03   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_04   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_05   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_06   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_07   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_08   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_09   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_10   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_11   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_12   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_13   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_14   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_15   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_16   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_17   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_18   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_19   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_20   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SAS_NSD_21   nsd         512         100 No       Yes   ready         up
>>>           system
>>> SSD_NSD_23   nsd         512         200 Yes      No    ready         up
>>>           system
>>> SSD_NSD_24   nsd         512         200 Yes      No    ready         up
>>>           system
>>> SSD_NSD_25   nsd         512         200 Yes      No    to be emptied
>>> down         system
>>> SSD_NSD_26   nsd         512         200 Yes      No    ready         up
>>>           system
>>>
>>> SSD_NSD_25 is a mirror in which both drives have failed due to a series
>>> of unfortunate events and will not be coming back. From the GPFS
>>> troubleshooting guide it appears that my only alternative is to run
>>>
>>> mmdeldisk grsnas_data  SSD_NSD_25 -p
>>>
>>> around which the documentation also warns is irreversible, the sky is
>>> likely to fall, dogs and cats sleeping together, etc. But at this point I'm
>>> already in an irreversible situation. Of course this is a scratch
>>> filesystem, of course people were warned repeatedly about the risk of using
>>> a scratch filesystem that is not backed up and of course many ignored that.
>>> I'd like to recover as much as possible here. Can anyone confirm/reject
>>> that deleting this disk is the best way forward or if there are other
>>> alternatives to recovering data from GPFS in this situation?
>>>
>>> Any input is appreciated. Adding salt to the wound is that until a few
>>> months ago I had a complete copy of this filesystem that I had made onto
>>> some new storage as a burn-in test but then removed as that storage was
>>> consumed... As they say, sometimes you eat the bear, and sometimes, well,
>>> the bear eats you.
>>>
>>> Thanks,
>>>
>>> jbh
>>>
>>> (Naively calculated probability of these two disks failing close
>>> together in this array: 0.00001758. I never get this lucky when buying
>>> lottery tickets.)
>>> --
>>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>>> improve the world of today.’
>>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>
>
> --
> Evan Burness
> Director, HPC Solutions
> Cycle Computing
> evan.burness at cyclecomputing.com
> (919) 724-9338
>
-- 
‘[A] talent for following the ways of yesterday, is not sufficient to
improve the world of today.’
 - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170429/2cd58436/attachment-0001.html>