[Beowulf] distributing storage amongst compute nodes

Sat Oct 20 14:33:29 PDT 2007

one interesting fact of configuring clusters is that,
barring bespoke hardware design like blades, it's quite cheap
to add substantial storage onto compute nodes.

sometimes, the cluster's purpose is specialized to support jobs
which do heavy local IO, so this is just fine.  but in general,
clusters have little or no _necessity_ for storage on compute nodes.
(swap is, from the kernel's perspective, always a good thing to have;
arguments for disk-fully-installed nodes are basically aesthetic,
since it's easy to show that diskless nodes generate quite modest 
amounts of "extra" IO.)

we usually choose quite minimal storage on each node - 1 or 2 disks
of the cheapest possible size.  currently most of our clusters have 
2x80G which we specified to enable raid1 for reliability.  (this has
turned out to be unnecessary, since commodity disks are plenty reliable
and are not a significant source of uptime problems.)

but maybe it makes sense not to fight the tide of disturbingly cheap
and dense storage.  even a normal 1U cluster node could often be configured
with several TB of local storage.  the question is: how to make use of it?

making every node part of something like PVFS seems doable.  this could
mean significant interference in tightly-coupled parallel jobs, though.
but it's not clear to me how significant that would be, since in principle,
remote read access is like an RDMA and might involve very little host
overhead.  for one, such a filesystem could be massively striped, so 
even for a large transfer, a single node would only be bothered a small 
amount.  the filesystem could even be smart in choosing, dynamically,
to avoid writing onto nodes that are busy with tight-coupled jobs.
to be practical, such a filesystem would almost certainly need to be able
to store content redundantly.

further, it would be quite interesting to look at moving the compute to
the data, rather than a compute job "summoning" data.  consider computing
paradigms like Google's map/reduce (which is remarkably general, btw.)

thoughts?

regards, mark hahn.