[Beowulf] lustre / pytorch

Michael DiDomenico mdidomenico4 at gmail.com
Thu Jul 11 16:25:05 UTC 2024


i have a strange problem, but honestly i'm not sure where the issue
is.  we have users running LLM models through pytorch.  part of the
process saves off checkpoints at periodic intervals.  when the
checkpoint files are being written we can see in the logs the pytorch
writing out the save files from each of the processes to lustre.

it chugs along for a little bit, but then comes to a grinding halt.
no error from pytorch is logged and no errors can be found on the
lustre clients or servers.  the problem is also not transient, it
happens every time the process runs

the weird part is, if we switch the output directory from lustre to
nfs (netapp backed), the pytorch run works perfectly fine

has anyone seen anything like this?  any suggestions on trouble
shooting the issue?

given that we have a 10x performance difference between netapp and
lustre, i'm pretty keen on getting this fixed


More information about the Beowulf mailing list