<p dir="ltr">I've seen this issue when running distributed and RANK isn't established. All workers think they are rank 0 and none of them can get a file lock to write. Eventually it just times out.</p>
<br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 12, 2024, 1:47 PM <a href="mailto:plegresl@gmail.com">plegresl@gmail.com</a> <<a href="mailto:plegresl@gmail.com">plegresl@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I’ve never seen any difficulties with PyTorch saving checkpoint files to Lustre. Is it a special file format or just torch.save()? When the processes hang, have you tried using something like py-spy and/or gdb to get a stack trace of where in the software stack it’s hung?<br>
<br>
> Date: Thu, 11 Jul 2024 12:25:05 -0400<br>
> From: Michael DiDomenico <<a href="mailto:mdidomenico4@gmail.com" target="_blank" rel="noreferrer">mdidomenico4@gmail.com</a>><br>
> To: Beowulf Mailing List <<a href="mailto:Beowulf@beowulf.org" target="_blank" rel="noreferrer">Beowulf@beowulf.org</a>><br>
> Subject: [Beowulf] lustre / pytorch<br>
> Message-ID:<br>
> <<a href="mailto:CABOsP2P7L4J8kJQRqxC9U_yJ3MLjhj68Z6fy17O5%2BE0WeEyUww@mail.gmail.com" target="_blank" rel="noreferrer">CABOsP2P7L4J8kJQRqxC9U_yJ3MLjhj68Z6fy17O5+E0WeEyUww@mail.gmail.com</a>><br>
> Content-Type: text/plain; charset="UTF-8"<br>
> <br>
> i have a strange problem, but honestly i'm not sure where the issue<br>
> is. we have users running LLM models through pytorch. part of the<br>
> process saves off checkpoints at periodic intervals. when the<br>
> checkpoint files are being written we can see in the logs the pytorch<br>
> writing out the save files from each of the processes to lustre.<br>
> <br>
> it chugs along for a little bit, but then comes to a grinding halt.<br>
> no error from pytorch is logged and no errors can be found on the<br>
> lustre clients or servers. the problem is also not transient, it<br>
> happens every time the process runs<br>
> <br>
> the weird part is, if we switch the output directory from lustre to<br>
> nfs (netapp backed), the pytorch run works perfectly fine<br>
> <br>
> has anyone seen anything like this? any suggestions on trouble<br>
> shooting the issue?<br>
> <br>
> given that we have a 10x performance difference between netapp and<br>
> lustre, i'm pretty keen on getting this fixed<br>
<br>
_______________________________________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank" rel="noreferrer">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf" rel="noreferrer noreferrer" target="_blank">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a><br>
</blockquote></div>