[Beowulf] lustre / pytorch
Ellis Wilson
ellis at ellisv3.com
Mon Jul 15 15:14:16 UTC 2024
Looks like you cross-posted on the Lustre list, which is a great spot to
ask. The things I would usually do here are:
1. If I can managed to reproduce this with a single process from a
single client, then I strace with numerous flags and see what syscall or
similar it's stuck on when it comes to a halt. Alternatively you can
attach to a seemingly hung process and you may see the last syscall it
issued and is waiting on (or issuing and timing out on), but that's not
always been my experience. If you can only repro this with lots of
clients and processes, attaching to a couple and waiting until they
time-out should give you a decent idea of what they are timing out on.
2. On Lustre if you have access to the MGS node you should be able to
register changelogs and enable a sufficiently broad changelog mask to
capture all calls to the system. Then trigger your problematic
workload, and finally read the changelogs out and look for what the hung
client(s) were doing around the time when the hang occurred. This is
expensive and you'll need to make sure you disable your changelogs after
the fact or you'll drive your MDS out of space in the long-term.
Best,
ellis
On 7/15/24 11:01, Michael DiDomenico wrote:
> that's interesting on two counts, one that file locks are in play.
> i've tried with both flock and noflock on the clients, but neither
> seemed to make a difference, (i presumed file locks weren't taking
> place)
>
> is there something we should put in the code to ensure all the RANK's
> are established at the beginning or maybe throughout the run (perhaps
> something odd happens later on)
>
> On Sat, Jul 13, 2024 at 3:47 AM Josh Catana <jcatana at gmail.com> wrote:
>>
>> I've seen this issue when running distributed and RANK isn't established. All workers think they are rank 0 and none of them can get a file lock to write. Eventually it just times out.
>>
>>
>> On Fri, Jul 12, 2024, 1:47 PM plegresl at gmail.com <plegresl at gmail.com> wrote:
>>>
>>> I’ve never seen any difficulties with PyTorch saving checkpoint files to Lustre. Is it a special file format or just torch.save()? When the processes hang, have you tried using something like py-spy and/or gdb to get a stack trace of where in the software stack it’s hung?
>>>
>>>> Date: Thu, 11 Jul 2024 12:25:05 -0400
>>>> From: Michael DiDomenico <mdidomenico4 at gmail.com>
>>>> To: Beowulf Mailing List <Beowulf at beowulf.org>
>>>> Subject: [Beowulf] lustre / pytorch
>>>> Message-ID:
>>>> <CABOsP2P7L4J8kJQRqxC9U_yJ3MLjhj68Z6fy17O5+E0WeEyUww at mail.gmail.com>
>>>> Content-Type: text/plain; charset="UTF-8"
>>>>
>>>> i have a strange problem, but honestly i'm not sure where the issue
>>>> is. we have users running LLM models through pytorch. part of the
>>>> process saves off checkpoints at periodic intervals. when the
>>>> checkpoint files are being written we can see in the logs the pytorch
>>>> writing out the save files from each of the processes to lustre.
>>>>
>>>> it chugs along for a little bit, but then comes to a grinding halt.
>>>> no error from pytorch is logged and no errors can be found on the
>>>> lustre clients or servers. the problem is also not transient, it
>>>> happens every time the process runs
>>>>
>>>> the weird part is, if we switch the output directory from lustre to
>>>> nfs (netapp backed), the pytorch run works perfectly fine
>>>>
>>>> has anyone seen anything like this? any suggestions on trouble
>>>> shooting the issue?
>>>>
>>>> given that we have a 10x performance difference between netapp and
>>>> lustre, i'm pretty keen on getting this fixed
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
More information about the Beowulf
mailing list