[Beowulf] help for metadata-intensive jobs (imagenet)
i.n.kozin at googlemail.com
Fri Jun 28 11:57:31 PDT 2019
Converting the files to TF records or similar would be one obvious approach
if you are concerned about meta data. But then I d understand why some
people would not want that (size, augmentation process). I assume you are
are doing the training in a distributed fashion using MPI via Horovod or
similar and it might be tempting to do file partitioning across the nodes.
However doing so introduces a bias into minibatches (and custom
preprocessing). If you partition carefully by mapping classes to nodes it
may work but I also understand why some wouldn't be totally happy with
that. Ive trained keras/TF/horovod models on imagenet using up to 6 nodes
each with four p100/v100 and it worked reasonably well. As the training
still took a few days copying to local NVMe disks was a good option.
On Fri, 28 Jun 2019, 18:47 Mark Hahn, <hahn at mcmaster.ca> wrote:
> Hi all,
> I wonder if anyone has comments on ways to avoid metadata bottlenecks
> for certain kinds of small-io-intensive jobs. For instance, ML on
> which seems to be a massive collection of trivial-sized files.
> A good answer is "beef up your MD server, since it helps everyone".
> That's a bit naive, though (no money-trees here.)
> How about things like putting the dataset into squashfs or some other
> image that can be loop-mounted on demand? sqlite? perhaps even a format
> that can simply be mmaped as a whole?
> personally, I tend to dislike the approach of having a job stage tons of
> stuff onto node storage (when it exists) simply because that guarantees a
> waste of cpu/gpu/memory resources for however long the stagein takes...
> thanks, mark hahn.
> operator may differ from spokesperson. hahn at mcmaster.ca
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf