[Beowulf] lustre / pytorch

Mon Jul 15 15:01:14 UTC 2024

that's interesting on two counts, one that file locks are in play.
i've tried with both flock and noflock on the clients, but neither
seemed to make a difference, (i presumed file locks weren't taking
place)

is there something we should put in the code to ensure all the RANK's
are established at the beginning or maybe throughout the run (perhaps
something odd happens later on)

On Sat, Jul 13, 2024 at 3:47 AM Josh Catana <jcatana at gmail.com> wrote:
>
> I've seen this issue when running distributed and RANK isn't established. All workers think they are rank 0 and none of them can get a file lock to write.  Eventually it just times out.
>
>
> On Fri, Jul 12, 2024, 1:47 PM plegresl at gmail.com <plegresl at gmail.com> wrote:
>>
>> I’ve never seen any difficulties with PyTorch saving checkpoint files to Lustre. Is it a special file format or just torch.save()? When the processes hang, have you tried using something like py-spy and/or gdb to get a stack trace of where in the software stack it’s hung?
>>
>> > Date: Thu, 11 Jul 2024 12:25:05 -0400
>> > From: Michael DiDomenico <mdidomenico4 at gmail.com>
>> > To: Beowulf Mailing List <Beowulf at beowulf.org>
>> > Subject: [Beowulf] lustre / pytorch
>> > Message-ID:
>> >       <CABOsP2P7L4J8kJQRqxC9U_yJ3MLjhj68Z6fy17O5+E0WeEyUww at mail.gmail.com>
>> > Content-Type: text/plain; charset="UTF-8"
>> >
>> > i have a strange problem, but honestly i'm not sure where the issue
>> > is.  we have users running LLM models through pytorch.  part of the
>> > process saves off checkpoints at periodic intervals.  when the
>> > checkpoint files are being written we can see in the logs the pytorch
>> > writing out the save files from each of the processes to lustre.
>> >
>> > it chugs along for a little bit, but then comes to a grinding halt.
>> > no error from pytorch is logged and no errors can be found on the
>> > lustre clients or servers.  the problem is also not transient, it
>> > happens every time the process runs
>> >
>> > the weird part is, if we switch the output directory from lustre to
>> > nfs (netapp backed), the pytorch run works perfectly fine
>> >
>> > has anyone seen anything like this?  any suggestions on trouble
>> > shooting the issue?
>> >
>> > given that we have a 10x performance difference between netapp and
>> > lustre, i'm pretty keen on getting this fixed
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf