<div dir="auto"><p dir="ltr">Depends on the distributed framework being used to train and/or the job submission platform. <br>

Usually the scheduler/submission platform (slurm,LSF PBS,random k8s components) takes care of assigning rank through env vars.<br>

The distributed framework (dist-torch, deep speed, Nemo, horovod, etc) should be checking for that.</p>

<p dir="ltr">I've seen issues where framework is checking for MPI_WORLD_RANK but platform is setting RANK which causes framework to think it's not being set and defaults everything to 0. <br>

Or sometimes it's just bad config of the framework by the data scientist.<br><br></p><p dir="ltr">

This causes a race condition on the lock when you write the first checkpoint. The end result is just a very long timeout at that point.</p></div>

<br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Jul 15, 2024, 11:01 AM Michael DiDomenico <<a href="mailto:mdidomenico4@gmail.com" target="_blank" rel="noreferrer">mdidomenico4@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">that's interesting on two counts, one that file locks are in play.<br>

i've tried with both flock and noflock on the clients, but neither<br>

seemed to make a difference, (i presumed file locks weren't taking<br>

place)<br>

<br>

is there something we should put in the code to ensure all the RANK's<br>

are established at the beginning or maybe throughout the run (perhaps<br>

something odd happens later on)<br>

<br>

On Sat, Jul 13, 2024 at 3:47 AM Josh Catana <<a href="mailto:jcatana@gmail.com" rel="noreferrer noreferrer" target="_blank">jcatana@gmail.com</a>> wrote:<br>

><br>

> I've seen this issue when running distributed and RANK isn't established. All workers think they are rank 0 and none of them can get a file lock to write.  Eventually it just times out.<br>

><br>

><br>

> On Fri, Jul 12, 2024, 1:47 PM <a href="mailto:plegresl@gmail.com" rel="noreferrer noreferrer" target="_blank">plegresl@gmail.com</a> <<a href="mailto:plegresl@gmail.com" rel="noreferrer noreferrer" target="_blank">plegresl@gmail.com</a>> wrote:<br>

>><br>

>> I’ve never seen any difficulties with PyTorch saving checkpoint files to Lustre. Is it a special file format or just torch.save()? When the processes hang, have you tried using something like py-spy and/or gdb to get a stack trace of where in the software stack it’s hung?<br>

>><br>

>> > Date: Thu, 11 Jul 2024 12:25:05 -0400<br>

>> > From: Michael DiDomenico <<a href="mailto:mdidomenico4@gmail.com" rel="noreferrer noreferrer" target="_blank">mdidomenico4@gmail.com</a>><br>

>> > To: Beowulf Mailing List <<a href="mailto:Beowulf@beowulf.org" rel="noreferrer noreferrer" target="_blank">Beowulf@beowulf.org</a>><br>

>> > Subject: [Beowulf] lustre / pytorch<br>

>> > Message-ID:<br>

>> >       <<a href="mailto:CABOsP2P7L4J8kJQRqxC9U_yJ3MLjhj68Z6fy17O5%2BE0WeEyUww@mail.gmail.com" rel="noreferrer noreferrer" target="_blank">CABOsP2P7L4J8kJQRqxC9U_yJ3MLjhj68Z6fy17O5+E0WeEyUww@mail.gmail.com</a>><br>

>> > Content-Type: text/plain; charset="UTF-8"<br>

>> ><br>

>> > i have a strange problem, but honestly i'm not sure where the issue<br>

>> > is.  we have users running LLM models through pytorch.  part of the<br>

>> > process saves off checkpoints at periodic intervals.  when the<br>

>> > checkpoint files are being written we can see in the logs the pytorch<br>

>> > writing out the save files from each of the processes to lustre.<br>

>> ><br>

>> > it chugs along for a little bit, but then comes to a grinding halt.<br>

>> > no error from pytorch is logged and no errors can be found on the<br>

>> > lustre clients or servers.  the problem is also not transient, it<br>

>> > happens every time the process runs<br>

>> ><br>

>> > the weird part is, if we switch the output directory from lustre to<br>

>> > nfs (netapp backed), the pytorch run works perfectly fine<br>

>> ><br>

>> > has anyone seen anything like this?  any suggestions on trouble<br>

>> > shooting the issue?<br>

>> ><br>

>> > given that we have a 10x performance difference between netapp and<br>

>> > lustre, i'm pretty keen on getting this fixed<br>

>><br>

>> _______________________________________________<br>

>> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" rel="noreferrer noreferrer" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

>> To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" rel="noreferrer noreferrer noreferrer" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>

><br>

> _______________________________________________<br>

> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" rel="noreferrer noreferrer" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

> To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" rel="noreferrer noreferrer noreferrer" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>

</blockquote></div>