[Beowulf] Torrents for HPC

Wed Jun 13 04:21:58 PDT 2012

On 06/13/12 10:59, Peter wrote:
> On 12/06/12 18:56, Ellis H. Wilson III wrote:
>> On 06/08/12 20:06, Bill Broadley wrote:
>>> A new user on one of my GigE clusters submits batches of 500 jobs that
>>> need to randomly read a 30-60GB dataset.  They aren't the only user of
>>> said cluster so each job will be waiting in the queue with a mix of others.
>> With a 160TB cluster and only a 30-60GB dataset, is there any reason why
>> the user isn't simply storing their dataset in HDFS?  Does the data
>> change frequently via a non-MapReduce framework such that it needs to be
>> pulled from NFS before every job?  If the dataset is in a few dozen
>> files and in HDFS in the cluster, there is no reason why MapReduce
>> shouldn't spawn it's tasks directly "on" the data, without need (most of
>> the time) for moving all of the data to every node as you mention.
>
>    From experience this can have varied results and still requires careful
> management/thought. With HDFS if the replicate number is 3 (often the
> default case) and the 30 node cluster has 500 jobs then either an
 > initial step is required to replicate the data to all other cluster
 > nodes and then perform the analysis (this imposes the expected network /
 > disk IO impact and job start up latency already in place).
 >

It really shouldn't require much management, nor initial data movement 
at all.  BTW, I understood 500 jobs to be totally agnostic about each 
other, as if they were calculating different things using the same 
dataset.  If these are 500 tasks within the same job, well, that's an 
entirely different matter.  If they are just jobs, it really doesn't 
matter if there are 5 or 500, as by default with Hadoop 0.20 at least 
jobs are executed in FIFO order.  Further, if the user programmed his or 
her application to be configurable for number of mappers and reducers, 
it is trivial to match the number of mappers to the slots in the system 
and reducers similarly (though often reducers is something much lower, 
like 1 per node).

Assuming the 30GB dataset is in 30 1GB files, which shouldn't be hard to 
guarantee or achieve, each node will get 1 of these files.  Therefore 
the user simply specifies that he or she wants (let's assume 2 map slots 
per node) 60 map tasks, and Hadoop will silently try to make sure each 
task ends up on one of the three nodes (assuming default triplication) 
that have a local data copy.

> Alternatively keep the replication at 3 (or a.n.other defined number)
> and limit the number of jobs to the available resources where the data
> replicates  pre-exist. The challenge is finding the sweet spot for the
> work in progress and as always nothing is ever free.

With only 30 nodes and 30 to 60GB of data, I think it is safe to assume 
the data exists /everywhere/ in the cluster.  Even if Hadoop was stupid 
and randomly selected a node there would be a 1/10 chance the data was 
already there, and it's not stupid, so it will check all three of the 
nodes with replicas before spawning the task elsewhere.  Now if there 
are 1000 nodes and just 30GB of data, then Hadoop will make sure your 
tasks are prioritized on the nodes that have your data or at least, in 
the same rack as the nodes that have it.

> So HDFS does not remove the replication process although it helps to
> hide the processes involved.

As I've said, if you set things up properly, there shouldn't be much, if 
any, replication, and Hadoop doesn't help to hide the replication -- it 
totally obscures the process.  You have no hand in doing so.

> The other joy encountered with HDFS is that we found it can be less than
> stable in a multi user environment, this has been confirmed by various
> others so as always care is required during testing.

I'll concede that original configuration can be tough, but I've assisted 
with management of an HDFS instance that stored ~60TB of data and over 
10 million files, both as scratch and for users home dirs.  It is 
certainly stable enough for day to day use.

> There are alternatives to HDFS which can be used in conjunction with
> Hadoop but I'm afraid I'm not able to recommend any in particular as
> it's been a while since I last kicked the tyres. Is this something that
> others have more recent experience with and can recommend an alternative ?

I'm working on an alternative to HDFS as we speak, which bypasses HDFS 
entirely and allows people using MapReduce to run directly against 
multiple NAS boxes as if they were a single federated storage system. 
I'll be sending something out to this list about the source when I 
release it.

Best,

ellis