[Beowulf] ***UNCHECKED*** Re: Re: Spark, Julia, OpenMPI etc. - all in one place

Douglas Eadline deadline at eadline.org
Tue Oct 13 12:54:18 PDT 2020


> On Tue, Oct 13, 2020 at 1:31 PM Douglas Eadline <deadline at eadline.org>
> wrote:
>
>>
>> The reality is almost all Analytics projects require multiple
>> tools. For instance, Spark is great, but if you do some
>> data munging of CSV files and want to store your results
>> at scale you can't write a single file to your local file
>> system. Often times you write it as a Hive table to HDFS
>> (e.g. in Parquet format) so it is available for Hive SQL
>> queries or for other tools to use.
>>
>
> You can also commit to a database (but you can't have those running on a
> traditional HPC cluster). What would be nice would be HDFS running on a
> traditional cluster. But that would break the whole parallel filesystem
> exposed as a single mount point thing.... It is funny how these things
> evolved apart from each other to the point they are impossible to marry,
> no?

It was two different goals. HDFS was designed to be a write
once read many "distributed" file system. It was never intended to
be a parallel filesystem or general purpose in any way. Streaming
though distributed data from beginning to end was the goal.
Most people are shocked to learn HDFS does not allow
random reads or writes to files (only appends). It really should
be called "Map Reduce Filesystem"

HPC parallel filesystems are designed to be general purpose
and can be used by Hadoop, there was at one point a shim
for Lustre, other file systems are supported, but you lose
the data locality of HDFS.

Another piece of trivia is an early version of Hadoop used Torque
as the scheduler.

As for HDFS running on a traditional cluster. It can be done,
but I think it easier to run two clusters. There is no
support for IB in Hadoop or Spark (IP of IB of course) so
if you have invested in IB, it is not going to get used
to its fullest potential.

It really depends on what you need to do with Hadoop or Spark.
IMO many organizations don't have enough data to justify
standing up a 16-24 node cluster system with a PB of HDFS.

--
Doug




> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>


-- 
Doug



More information about the Beowulf mailing list