<div dir="ltr">Tom, as Reuti says let's have a look at the nature of these files.<div>what are they, and are analysis jobs really revisiting them again and again?</div><div><br></div><div>This is a marvellous tool for analysing filesystem usage:</div>
<div><a href="http://www.chiark.greenend.org.uk/~sgtatham/agedu/">http://www.chiark.greenend.org.uk/~sgtatham/agedu/</a><br></div><div><br></div><div>I have used it a lot in the past on the scratch storage of our clusters to highlight data which hadn't been used in ages.</div>
<div><br></div><div>I'm not sure how long agedu will take to index a large Lustre filesystem like yours, but it would be well worth having a try.</div><div><br></div><div>Agedu doesn't work on DMF filesystems (as it uses a stat ont he file, and migrated files would appear to be very small).</div>
<div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 12 June 2014 12:12, John Hearns <span dir="ltr"><<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Tom, <div>I agree with you regarding small files.</div><div>In my case, I manage a DMF (SGI Data Migration Facility) setup.</div>
<div>I was concerned at the amount of small files which we were storing - in terms of the size of the database files, and storing small files to tape.</div>
<div>SGI engineers reassured me that the system will happily cope with millions of files, and does so on many sites.</div><div>DMF also waits till a large 'chunk' is to be written to tape, ie small writes are queued up.</div>
<div><br></div><div>However, when watching the amount of files being pushed to the tape tier one day I noticed something like 10 000 files or more from one user.</div><div>Cue the application of a LART. </div><div>Seriously though - I did have a word and he agreed to zip up all the small PNG files his project was generating.</div>
<div><br></div><div>I have a general policy here that when lots of small files are generated then the directory is zipped up and the zip files is stored.</div><div>We have codes which generate lots of zip files which are stitched together into movies, and we also store wind tunnel data which is again</div>
<div>lots of PNG files. It is unlikely that anyone would ever want the raw data files again, but if they should do then an unzip is easy.</div><div class=""><div><br></div><div><br></div><div><span style="font-family:arial,sans-serif;font-size:13px">> Do you distinguish and segregate them (and/or the people that use them) on special</span><br style="font-family:arial,sans-serif;font-size:13px">
<span style="font-family:arial,sans-serif;font-size:13px">> hardware/filesystems?</span><br></div></div><div>Suggest you invest in a LART. <a href="http://dictionary.reference.com/browse/lart" target="_blank">http://dictionary.reference.com/browse/lart</a></div>
<div><br></div><div><br></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">On 12 June 2014 11:43, Reuti <span dir="ltr"><<a href="mailto:reuti@staff.uni-marburg.de" target="_blank">reuti@staff.uni-marburg.de</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
<br>
Am 11.06.2014 um 21:03 schrieb Tom Harvill:<br>
<div><br>
> This is my first time posting to this list, thanks in advance for any time you spend<br>
> replying.<br>
><br>
> We've found that a large majority of our files (~40MM of ~50MM) are less than 10KB.<br>
> We believe our filesystem (lustre) is bottlenecked with IOPs and locking related to<br>
> jobs running against these files. We have ~700TB usable storage with ~500TB consumed,<br>
> almost all consumption is by a relatively small number of very very large files.<br>
<br>
</div>What data is represented in 10KB: binary or ASCII data - would it work to put it in a database instead of all these single files? How do you access the files: by some kind of index, name, directory...?<br>
<span><font color="#888888"><br>
-- Reuti<br>
</font></span><div><div><br>
<br>
> I want to ask this general question: how does your shop deal with the general problem of<br>
> small files in filesystems on (beowulf) compute clusters? Specifically, files that users expect<br>
> to actively use for read and write operations for their research.<br>
><br>
> Do you distinguish and segregate them (and/or the people that use them) on special<br>
> hardware/filesystems?<br>
><br>
> Thanks!<br>
> Tom<br>
><br>
> Tom Harvill<br>
> Holland Computing Center<br>
> University of Nebraska<br>
> _______________________________________________<br>
> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
> To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
<br>
_______________________________________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>