[Beowulf] Torrents for HPC
Peter
pc7 at sanger.ac.uk
Wed Jun 13 07:59:58 PDT 2012
On 12/06/12 18:56, Ellis H. Wilson III wrote:
> On 06/08/12 20:06, Bill Broadley wrote:
>> A new user on one of my GigE clusters submits batches of 500 jobs that
>> need to randomly read a 30-60GB dataset. They aren't the only user of
>> said cluster so each job will be waiting in the queue with a mix of others.
> With a 160TB cluster and only a 30-60GB dataset, is there any reason why
> the user isn't simply storing their dataset in HDFS? Does the data
> change frequently via a non-MapReduce framework such that it needs to be
> pulled from NFS before every job? If the dataset is in a few dozen
> files and in HDFS in the cluster, there is no reason why MapReduce
> shouldn't spawn it's tasks directly "on" the data, without need (most of
> the time) for moving all of the data to every node as you mention.
From experience this can have varied results and still requires careful
management/thought. With HDFS if the replicate number is 3 (often the
default case) and the 30 node cluster has 500 jobs then either an
initial step is required to replicate the data to all other cluster
nodes and then perform the analysis (this imposes the expected network /
disk IO impact and job start up latency already in place).
Alternatively keep the replication at 3 (or a.n.other defined number)
and limit the number of jobs to the available resources where the data
replicates pre-exist. The challenge is finding the sweet spot for the
work in progress and as always nothing is ever free.
So HDFS does not remove the replication process although it helps to
hide the processes involved.
The other joy encountered with HDFS is that we found it can be less than
stable in a multi user environment, this has been confirmed by various
others so as always care is required during testing.
There are alternatives to HDFS which can be used in conjunction with
Hadoop but I'm afraid I'm not able to recommend any in particular as
it's been a while since I last kicked the tyres. Is this something that
others have more recent experience with and can recommend an alternative ?
Pete
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
More information about the Beowulf
mailing list