[Beowulf] lustre / pytorch

Mon Jul 15 16:12:31 UTC 2024

unfortunately, so far the lustre system isn't producing any errors on
the mgs/mds/ost or the client.  i'm going to work with the dev this
afternoon and see if can pull a lustre debug trace from the systems
and see if that turns up anything

also unfortunate, is that we need 8 nodes to kick the error.  the
model unfortunately wont find on anything smaller :(

On Mon, Jul 15, 2024 at 11:19 AM Ellis Wilson <ellis at ellisv3.com> wrote:
>
> Looks like you cross-posted on the Lustre list, which is a great spot to
> ask.  The things I would usually do here are:
>
> 1. If I can managed to reproduce this with a single process from a
> single client, then I strace with numerous flags and see what syscall or
> similar it's stuck on when it comes to a halt.  Alternatively you can
> attach to a seemingly hung process and you may see the last syscall it
> issued and is waiting on (or issuing and timing out on), but that's not
> always been my experience.  If you can only repro this with lots of
> clients and processes, attaching to a couple and waiting until they
> time-out should give you a decent idea of what they are timing out on.
>
> 2. On Lustre if you have access to the MGS node you should be able to
> register changelogs and enable a sufficiently broad changelog mask to
> capture all calls to the system.  Then trigger your problematic
> workload, and finally read the changelogs out and look for what the hung
> client(s) were doing around the time when the hang occurred.  This is
> expensive and you'll need to make sure you disable your changelogs after
> the fact or you'll drive your MDS out of space in the long-term.
>
> Best,
>
> ellis
>
> On 7/15/24 11:01, Michael DiDomenico wrote:
> > that's interesting on two counts, one that file locks are in play.
> > i've tried with both flock and noflock on the clients, but neither
> > seemed to make a difference, (i presumed file locks weren't taking
> > place)
> >
> > is there something we should put in the code to ensure all the RANK's
> > are established at the beginning or maybe throughout the run (perhaps
> > something odd happens later on)
> >
> > On Sat, Jul 13, 2024 at 3:47 AM Josh Catana <jcatana at gmail.com> wrote:
> >>
> >> I've seen this issue when running distributed and RANK isn't established. All workers think they are rank 0 and none of them can get a file lock to write.  Eventually it just times out.
> >>
> >>
> >> On Fri, Jul 12, 2024, 1:47 PM plegresl at gmail.com <plegresl at gmail.com> wrote:
> >>>
> >>> I’ve never seen any difficulties with PyTorch saving checkpoint files to Lustre. Is it a special file format or just torch.save()? When the processes hang, have you tried using something like py-spy and/or gdb to get a stack trace of where in the software stack it’s hung?
> >>>
> >>>> Date: Thu, 11 Jul 2024 12:25:05 -0400
> >>>> From: Michael DiDomenico <mdidomenico4 at gmail.com>
> >>>> To: Beowulf Mailing List <Beowulf at beowulf.org>
> >>>> Subject: [Beowulf] lustre / pytorch
> >>>> Message-ID:
> >>>>        <CABOsP2P7L4J8kJQRqxC9U_yJ3MLjhj68Z6fy17O5+E0WeEyUww at mail.gmail.com>
> >>>> Content-Type: text/plain; charset="UTF-8"
> >>>>
> >>>> i have a strange problem, but honestly i'm not sure where the issue
> >>>> is.  we have users running LLM models through pytorch.  part of the
> >>>> process saves off checkpoints at periodic intervals.  when the
> >>>> checkpoint files are being written we can see in the logs the pytorch
> >>>> writing out the save files from each of the processes to lustre.
> >>>>
> >>>> it chugs along for a little bit, but then comes to a grinding halt.
> >>>> no error from pytorch is logged and no errors can be found on the
> >>>> lustre clients or servers.  the problem is also not transient, it
> >>>> happens every time the process runs
> >>>>
> >>>> the weird part is, if we switch the output directory from lustre to
> >>>> nfs (netapp backed), the pytorch run works perfectly fine
> >>>>
> >>>> has anyone seen anything like this?  any suggestions on trouble
> >>>> shooting the issue?
> >>>>
> >>>> given that we have a 10x performance difference between netapp and
> >>>> lustre, i'm pretty keen on getting this fixed
> >>>
> >>> _______________________________________________
> >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> >>> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >>
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> >> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf