[Beowulf] Small files

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Fri Jun 13 11:46:21 PDT 2014

On 6/13/14, 7:03 AM, "Ellis H. Wilson III" <ellis at cse.psu.edu> wrote:

>On 06/13/2014 09:31 AM, Joe Landman wrote:
>> On 06/13/2014 09:17 AM, Skylar Thompson wrote:
>>> We've recently implemented a quota of 1 million files per 1TB of
>>> filesystem space. And yes, we had to clean up a number of groups' and
>>> individuals' spaces before implementing that. There seems to be a trend
>>> in the bioinformatics community for using the filesystem as a database.
>> I wasn't going to say anything about this, but, yes, there are some
>> significant abuses of file systems going on in this community.  But this
>> is nothing new, sadly ...  I've seen this since the late 90's.
>I think we're all probably too close to the tool in question (HPC
>storage).  Ultimately this is just a hammer for scientists and other
>non-CS/IT types, so of course they are going to scoff when we tell them
>they are holding the hammer such that it hits sideways.  "Who's to tell
>me how to hold the hammer?!  This side has more metallic surface area
>anyhow, making it easier to hit the nail this way!"
>So you can either:
>a) Fix it transparently with automatic policies/FS's in the back-end.
>(I know of at least one FS that packs small files with metadata
>transparently on SSDs to expedite small file IOPS, but message me
>off-list for that as I start work for that shop soon and don't want to
>so blatantly advertise).  There are limits to how much these

Let¹s not let ³concern for efficiency² get in the way of ³users solving
problems².  I suspect that for a LOT of problems, buying more/faster
hardware is more cost effective than changing how the
scientist/engineer/user works.

Sure, there are HPC applications which are run repeatedly and for which
performance is very important (numerical weather simulations, for

If it¹s that big a deal, why not make it transparent:  as Ellis gave an
example of a system that ³blocks² small transactions into better ones
transparently.  That is the way it should be:  the user doesn¹t care how
it happens.

Do you manually manage memory allocation and cacheing?  Or do you let the
OS take care of it.  Heartbleed is a fine example of what happens someone
tries to ³optimize² the performance.

Obviously, if you¹re a ³developer of HPC² as a opposed to a ³user of HPC²,
then understanding what works better or worse or is more or less efficient
is important.  But there¹s a LOT more ³users of HPC² who are NOT
³developers of HPC², and that¹s who should be the focus.

Doesn¹t this harken back to the perennial assembler vs high level language
dispute.  I think you should spend your time making better optimizing
compilers (or better languages for specifying what it is you want to do)
rather than advocating programming in assembler.


More information about the Beowulf mailing list