[Beowulf] Small files

Ellis H. Wilson III ellis at cse.psu.edu
Fri Jun 13 07:03:27 PDT 2014

On 06/13/2014 09:31 AM, Joe Landman wrote:
> On 06/13/2014 09:17 AM, Skylar Thompson wrote:
>> We've recently implemented a quota of 1 million files per 1TB of
>> filesystem space. And yes, we had to clean up a number of groups' and
>> individuals' spaces before implementing that. There seems to be a trend
>> in the bioinformatics community for using the filesystem as a database.
> I wasn't going to say anything about this, but, yes, there are some
> significant abuses of file systems going on in this community.  But this
> is nothing new, sadly ...  I've seen this since the late 90's.

I think we're all probably too close to the tool in question (HPC 
storage).  Ultimately this is just a hammer for scientists and other 
non-CS/IT types, so of course they are going to scoff when we tell them 
they are holding the hammer such that it hits sideways.  "Who's to tell 
me how to hold the hammer?!  This side has more metallic surface area 
anyhow, making it easier to hit the nail this way!"

So you can either:
a) Fix it transparently with automatic policies/FS's in the back-end. 
(I know of at least one FS that packs small files with metadata 
transparently on SSDs to expedite small file IOPS, but message me 
off-list for that as I start work for that shop soon and don't want to 
so blatantly advertise).  There are limits to how much these 
policies/FS's can fix though.  Bad I/O will still be Bad I/O after a point.
b) Enact any number of the "rules" mentioned previously and tell the 
users, "no really, we know a thing or two about these systems, learn how 
to hold the hammer."  You may need to demonstrate on their skull a few 
times for the proper orientation to sink in.

> I did teach a graduate course on HPC programming at my alma mater about
> a decade ago.  Covered parallelism, optimization, and gave rough rubrics
> for how to write code that made effective use of the machine resources.
>   I had face-palm moments when one of the kids told me he didn't know C,
> but could work in C++.  Now-a-days we'd be lucky to find anyone whose
> minds were not polluted by Java + other bad-for-hpc things.

What?  How dare you!  I pollute my mind with Java every morning, maybe 
two or three cups full before any real work gets done ;).

Joking aside, making good use of CPU/Memory resources in the HPC context 
even today still requires good knowledge of C/Fortran and all of their 
associated parallelism libraries.  However, I am not so convinced making 
optimal or near-optimal use of remote storage cares whatsoever between 
C, C++, or Java for that matter.  You can do very horrible things 
I/O-wise in any of those languages, byte-sized I/O perhaps even more 
likely to happen in a byte-oriented language like C.



Ph.D. Candidate
Department of Computer Science and Engineering
The Pennsylvania State University

More information about the Beowulf mailing list