[Beowulf] Small files
Ellis H. Wilson III
ellis at cse.psu.edu
Fri Jun 13 07:03:27 PDT 2014
On 06/13/2014 09:31 AM, Joe Landman wrote:
> On 06/13/2014 09:17 AM, Skylar Thompson wrote:
>> We've recently implemented a quota of 1 million files per 1TB of
>> filesystem space. And yes, we had to clean up a number of groups' and
>> individuals' spaces before implementing that. There seems to be a trend
>> in the bioinformatics community for using the filesystem as a database.
> I wasn't going to say anything about this, but, yes, there are some
> significant abuses of file systems going on in this community. But this
> is nothing new, sadly ... I've seen this since the late 90's.
I think we're all probably too close to the tool in question (HPC
storage). Ultimately this is just a hammer for scientists and other
non-CS/IT types, so of course they are going to scoff when we tell them
they are holding the hammer such that it hits sideways. "Who's to tell
me how to hold the hammer?! This side has more metallic surface area
anyhow, making it easier to hit the nail this way!"
So you can either:
a) Fix it transparently with automatic policies/FS's in the back-end.
(I know of at least one FS that packs small files with metadata
transparently on SSDs to expedite small file IOPS, but message me
off-list for that as I start work for that shop soon and don't want to
so blatantly advertise). There are limits to how much these
policies/FS's can fix though. Bad I/O will still be Bad I/O after a point.
b) Enact any number of the "rules" mentioned previously and tell the
users, "no really, we know a thing or two about these systems, learn how
to hold the hammer." You may need to demonstrate on their skull a few
times for the proper orientation to sink in.
> I did teach a graduate course on HPC programming at my alma mater about
> a decade ago. Covered parallelism, optimization, and gave rough rubrics
> for how to write code that made effective use of the machine resources.
> I had face-palm moments when one of the kids told me he didn't know C,
> but could work in C++. Now-a-days we'd be lucky to find anyone whose
> minds were not polluted by Java + other bad-for-hpc things.
What? How dare you! I pollute my mind with Java every morning, maybe
two or three cups full before any real work gets done ;).
Joking aside, making good use of CPU/Memory resources in the HPC context
even today still requires good knowledge of C/Fortran and all of their
associated parallelism libraries. However, I am not so convinced making
optimal or near-optimal use of remote storage cares whatsoever between
C, C++, or Java for that matter. You can do very horrible things
I/O-wise in any of those languages, byte-sized I/O perhaps even more
likely to happen in a byte-oriented language like C.
Department of Computer Science and Engineering
The Pennsylvania State University
More information about the Beowulf