[Beowulf] lustre / pytorch

Josh Catana jcatana at gmail.com
Sat Jul 13 01:50:56 UTC 2024


I've seen this issue when running distributed and RANK isn't established.
All workers think they are rank 0 and none of them can get a file lock to
write.  Eventually it just times out.

On Fri, Jul 12, 2024, 1:47 PM plegresl at gmail.com <plegresl at gmail.com> wrote:

> I’ve never seen any difficulties with PyTorch saving checkpoint files to
> Lustre. Is it a special file format or just torch.save()? When the
> processes hang, have you tried using something like py-spy and/or gdb to
> get a stack trace of where in the software stack it’s hung?
>
> > Date: Thu, 11 Jul 2024 12:25:05 -0400
> > From: Michael DiDomenico <mdidomenico4 at gmail.com>
> > To: Beowulf Mailing List <Beowulf at beowulf.org>
> > Subject: [Beowulf] lustre / pytorch
> > Message-ID:
> >       <
> CABOsP2P7L4J8kJQRqxC9U_yJ3MLjhj68Z6fy17O5+E0WeEyUww at mail.gmail.com>
> > Content-Type: text/plain; charset="UTF-8"
> >
> > i have a strange problem, but honestly i'm not sure where the issue
> > is.  we have users running LLM models through pytorch.  part of the
> > process saves off checkpoints at periodic intervals.  when the
> > checkpoint files are being written we can see in the logs the pytorch
> > writing out the save files from each of the processes to lustre.
> >
> > it chugs along for a little bit, but then comes to a grinding halt.
> > no error from pytorch is logged and no errors can be found on the
> > lustre clients or servers.  the problem is also not transient, it
> > happens every time the process runs
> >
> > the weird part is, if we switch the output directory from lustre to
> > nfs (netapp backed), the pytorch run works perfectly fine
> >
> > has anyone seen anything like this?  any suggestions on trouble
> > shooting the issue?
> >
> > given that we have a 10x performance difference between netapp and
> > lustre, i'm pretty keen on getting this fixed
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20240712/48bd7a79/attachment.htm>


More information about the Beowulf mailing list