[Beowulf] help for metadata-intensive jobs (imagenet)

Fri Jun 28 11:57:31 PDT 2019

Converting the files to TF records or similar would be one obvious approach
if you are concerned about meta data. But then I d understand why some
people would not want that (size, augmentation process). I assume you are
are doing the training in a distributed fashion using MPI via Horovod or
similar and it might be tempting to do file partitioning across the nodes.
However doing so introduces a bias into minibatches (and custom
preprocessing). If you partition carefully by mapping classes to nodes it
may work but I also understand why some wouldn't be totally happy with
that. Ive trained keras/TF/horovod models on imagenet using up to 6 nodes
each with four p100/v100 and it worked reasonably well. As the training
still took a few days copying to local NVMe disks was a good option.
Hth

On Fri, 28 Jun 2019, 18:47 Mark Hahn, <hahn at mcmaster.ca> wrote:

> Hi all,
> I wonder if anyone has comments on ways to avoid metadata bottlenecks
> for certain kinds of small-io-intensive jobs.  For instance, ML on
> imagenet,
> which seems to be a massive collection of trivial-sized files.
>
> A good answer is "beef up your MD server, since it helps everyone".
> That's a bit naive, though (no money-trees here.)
>
> How about things like putting the dataset into squashfs or some other
> image that can be loop-mounted on demand?  sqlite?  perhaps even a format
> that can simply be mmaped as a whole?
>
> personally, I tend to dislike the approach of having a job stage tons of
> stuff onto node storage (when it exists) simply because that guarantees a
> waste of cpu/gpu/memory resources for however long the stagein takes...
>
> thanks, mark hahn.
> --
> operator may differ from spokesperson.              hahn at mcmaster.ca
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190628/bea30a28/attachment.html>