[Beowulf] GPFS and failed metadata NSD
Evan Burness
evan.burness at cyclecomputing.com
Sat Apr 29 09:04:40 PDT 2017
Hi John,
I'm not a GPFS expert, but I did manage some staff that ran GPFS
filesystems while I was at NCSA. Those folks reeeaaalllly knew what they
were doing.
Perhaps a dumb question, but should we infer from your note that metadata
replication is not enabled across those 4 NSDs handling it?
Best,
Evan
-------------------------
Evan Burness
Director, HPC
Cycle Computing
evan.burness at cyclecomputing.com
(919) 724-9338
On Sat, Apr 29, 2017 at 9:36 AM, Peter St. John <peter.st.john at gmail.com>
wrote:
> just a friendly reminder that while the probability of a particular
> coincidence might be very low, the probability that there will be **some**
> coincidence is very high.
>
> Peter (pedant)
>
> On Sat, Apr 29, 2017 at 3:00 AM, John Hanks <griznog at gmail.com> wrote:
>
>> Hi,
>>
>> I'm not getting much useful vendor information so I thought I'd ask here
>> in the hopes that a GPFS expert can offer some advice. We have a GPFS
>> system which has the following disk config:
>>
>> [root at grsnas01 ~]# mmlsdisk grsnas_data
>> disk driver sector failure holds holds
>> storage
>> name type size group metadata data status
>> availability pool
>> ------------ -------- ------ ----------- -------- ----- -------------
>> ------------ ------------
>> SAS_NSD_00 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_01 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_02 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_03 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_04 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_05 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_06 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_07 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_08 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_09 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_10 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_11 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_12 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_13 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_14 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_15 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_16 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_17 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_18 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_19 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_20 nsd 512 100 No Yes ready up
>> system
>> SAS_NSD_21 nsd 512 100 No Yes ready up
>> system
>> SSD_NSD_23 nsd 512 200 Yes No ready up
>> system
>> SSD_NSD_24 nsd 512 200 Yes No ready up
>> system
>> SSD_NSD_25 nsd 512 200 Yes No to be emptied
>> down system
>> SSD_NSD_26 nsd 512 200 Yes No ready up
>> system
>>
>> SSD_NSD_25 is a mirror in which both drives have failed due to a series
>> of unfortunate events and will not be coming back. From the GPFS
>> troubleshooting guide it appears that my only alternative is to run
>>
>> mmdeldisk grsnas_data SSD_NSD_25 -p
>>
>> around which the documentation also warns is irreversible, the sky is
>> likely to fall, dogs and cats sleeping together, etc. But at this point I'm
>> already in an irreversible situation. Of course this is a scratch
>> filesystem, of course people were warned repeatedly about the risk of using
>> a scratch filesystem that is not backed up and of course many ignored that.
>> I'd like to recover as much as possible here. Can anyone confirm/reject
>> that deleting this disk is the best way forward or if there are other
>> alternatives to recovering data from GPFS in this situation?
>>
>> Any input is appreciated. Adding salt to the wound is that until a few
>> months ago I had a complete copy of this filesystem that I had made onto
>> some new storage as a burn-in test but then removed as that storage was
>> consumed... As they say, sometimes you eat the bear, and sometimes, well,
>> the bear eats you.
>>
>> Thanks,
>>
>> jbh
>>
>> (Naively calculated probability of these two disks failing close together
>> in this array: 0.00001758. I never get this lucky when buying lottery
>> tickets.)
>> --
>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>> improve the world of today.’
>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
--
Evan Burness
Director, HPC Solutions
Cycle Computing
evan.burness at cyclecomputing.com
(919) 724-9338
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170429/ba898924/attachment.html>
More information about the Beowulf
mailing list