[Beowulf] Torrents for HPC

Peter pc7 at sanger.ac.uk
Wed Jun 13 08:43:53 PDT 2012


On 13/06/12 12:21, Ellis H. Wilson III wrote:
> On 06/13/12 10:59, Peter wrote:
>> On 12/06/12 18:56, Ellis H. Wilson III wrote:
>>> On 06/08/12 20:06, Bill Broadley wrote:
>>>> A new user on one of my GigE clusters submits batches of 500 jobs that
>>>> need to randomly read a 30-60GB dataset.  They aren't the only user of
>>>> said cluster so each job will be waiting in the queue with a mix of others.
>>> With a 160TB cluster and only a 30-60GB dataset, is there any reason why
>>> the user isn't simply storing their dataset in HDFS?  Does the data
>>> change frequently via a non-MapReduce framework such that it needs to be
>>> pulled from NFS before every job?  If the dataset is in a few dozen
>>> files and in HDFS in the cluster, there is no reason why MapReduce
>>> shouldn't spawn it's tasks directly "on" the data, without need (most of
>>> the time) for moving all of the data to every node as you mention.
>>      From experience this can have varied results and still requires careful
>> management/thought. With HDFS if the replicate number is 3 (often the
>> default case) and the 30 node cluster has 500 jobs then either an
>   >  initial step is required to replicate the data to all other cluster
>   >  nodes and then perform the analysis (this imposes the expected network /
>   >  disk IO impact and job start up latency already in place).
>   >
>
> It really shouldn't require much management, nor initial data movement
> at all.  BTW, I understood 500 jobs to be totally agnostic about each
> other, as if they were calculating different things using the same
> dataset.  If these are 500 tasks within the same job, well, that's an
> entirely different matter.  If they are just jobs, it really doesn't
> matter if there are 5 or 500, as by default with Hadoop 0.20 at least
> jobs are executed in FIFO order.  Further, if the user programmed his or
> her application to be configurable for number of mappers and reducers,
> it is trivial to match the number of mappers to the slots in the system
> and reducers similarly (though often reducers is something much lower,
> like 1 per node).
>
> Assuming the 30GB dataset is in 30 1GB files, which shouldn't be hard to
> guarantee or achieve, each node will get 1 of these files.  Therefore
> the user simply specifies that he or she wants (let's assume 2 map slots
> per node) 60 map tasks, and Hadoop will silently try to make sure each
> task ends up on one of the three nodes (assuming default triplication)
> that have a local data copy.
>
>> Alternatively keep the replication at 3 (or a.n.other defined number)
>> and limit the number of jobs to the available resources where the data
>> replicates  pre-exist. The challenge is finding the sweet spot for the
>> work in progress and as always nothing is ever free.
> With only 30 nodes and 30 to 60GB of data, I think it is safe to assume
> the data exists /everywhere/ in the cluster.  Even if Hadoop was stupid
> and randomly selected a node there would be a 1/10 chance the data was
> already there, and it's not stupid, so it will check all three of the
> nodes with replicas before spawning the task elsewhere.  Now if there
> are 1000 nodes and just 30GB of data, then Hadoop will make sure your
> tasks are prioritized on the nodes that have your data or at least, in
> the same rack as the nodes that have it.
>
>> So HDFS does not remove the replication process although it helps to
>> hide the processes involved.
> As I've said, if you set things up properly, there shouldn't be much, if
> any, replication, and Hadoop doesn't help to hide the replication -- it
> totally obscures the process.  You have no hand in doing so.
>
>> The other joy encountered with HDFS is that we found it can be less than
>> stable in a multi user environment, this has been confirmed by various
>> others so as always care is required during testing.
> I'll concede that original configuration can be tough, but I've assisted
> with management of an HDFS instance that stored ~60TB of data and over
> 10 million files, both as scratch and for users home dirs.  It is
> certainly stable enough for day to day use.
>
>> There are alternatives to HDFS which can be used in conjunction with
>> Hadoop but I'm afraid I'm not able to recommend any in particular as
>> it's been a while since I last kicked the tyres. Is this something that
>> others have more recent experience with and can recommend an alternative ?
> I'm working on an alternative to HDFS as we speak, which bypasses HDFS
> entirely and allows people using MapReduce to run directly against
> multiple NAS boxes as if they were a single federated storage system.
> I'll be sending something out to this list about the source when I
> release it.
>
> Best,
>
Many thanks for your comments Ellis,

I read the initial Q that the full data set may be required by any job 
so an upgrade to my personal filters may be required :). If this were 
the case then post job submission it becomes a wait until a node with 
the data becomes available or alternatively a copy to a.n.other node 
needs to take place before it can be used for the task at hand. At this 
point it's sort of a balance between how many nodes are available 
immediately for the task and how long do you wish to wait, either for 
the FIFO tasks to complete on a subset of available nodes or the copy to 
take place.

Given that 30-60Gb is small enough copy everywhere, that sort of takes 
things full circle to the initial rsync options (and variants) 
previously discussed to local disk. Although I apologise if I'm 
miss-interpreting the above.

The comment regarding the obscuring the replication process was directed 
more towards the user experience, they don't need to know it 
automagically happens BUT behind the scenes the copies are happening all 
the same, with the expected impact incurred on IO etc. So HDFS doesn't 
make the process impact free.

If you are able to send more to the list regarding HDFS plan B that 
would be great and certainly something I'd be interested in hearing more 
about. Do you have a blog or similar with references regarding any of 
the above ? If so that would be much appreciated.

Thanks again and good luck with the multiple NAS option.
Pete


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


More information about the Beowulf mailing list