[Beowulf] ***UNCHECKED*** Re: Spark, Julia, OpenMPI etc. - all in one place

Douglas Eadline deadline at eadline.org
Tue Oct 13 07:28:41 PDT 2020


> Hi Doug,
>
> How have they managed to squeeze so much performance out of java for such
> big data sets?

Nothing to do with Java, originally had to do with "moving computation
to data" (Hadoop YARN can provide data locality for Map Reduce,
i.e. large files are sliced on HDFS data nodes, the Map process
can operate in parallel on these slices, so you run the Map
in parallel on the node that has the data slice (a bit more
complicated but that is the general idea)

Now the trick is to keep intermediate results in memory
because most of the high level analytics jobs involve
multiple Map Reduce steps. This is why Spark called
as "faster than Hadoop" because every thing was done across
a distributed memory object (across nodes). Hadoop
Map Reduce now does this with the "Tez" acceleration
component.

Here is another important point, Data Engineering (data
cleaning, verification, building feature matrix) is where
scale comes into play. Running models (unless you are
training an ML) usually does not require a huge amount of
computing power.

--
Doug


>
> Regards,
> Jonathan
>
> -----Original Message-----
> From: Beowulf <beowulf-bounces at beowulf.org> On Behalf Of Douglas Eadline
> Sent: 13 October 2020 15:55
> To: Oddo Da <oddodaoddo at gmail.com>
> Cc: beowulf at beowulf.org
> Subject: [Beowulf] ***UNCHECKED*** Re: Spark, Julia, OpenMPI etc. - all in
> one place
>
>
> I have noticed a lot of Hadoop/Spark references in the replies.
> The word "Hadoop" is probably the most misunderstood word in computing
> today and may people have a somewhat vague idea what it actually is.
>
> Hadoop V1 was a monolithic Map Reduce framework written in Java. (BTW Map
> Reduce is a SIMD algorithm)
>
> Hadoop V2 the Map Reduce component was separated from the scheduler (YARN)
> and the underlying distributed file systems (HDFS) It is best thought of
> as a "platform" for developing big data systems. The most popular Map
> Reduce application is Hive.
> Developed by Facebook, it allow relational databases to be run at scale.
>
> Hadoop V3 and beyond is moving more toward a true cloud based environment
> with a new file systems called Ozone. Note, the need for HDFS made cloud
> migration difficult
>
> Spark is a completely separate code base that has its own Map Reduce
> engine. It can work stand-alone, with the YARN scheduler, or with other
> schedulers. It can also take advantage of HDFS.
>
> Spark is language, Hadoop is platform. Map Reduce is SIMD algorithm that
> works well with large amounts of read-only data.
>
> There is more to it, but that is the gist of it.
>
> --
> Doug
>
>> Hello,
>>
>> I used to be in HPC back when we built beowulf clusters by hand ;) and
>> wrote code in C/pthreads, PVM and MPI and back when anyone could walk
>> into fields like bioinformatics, all that was needed was a pulse, some
>> C and Perl and a desire to do ;-). Then I left for the private sector
>> and stumbled into "big data" some years later - I wrote a lot of code
>> in Spark and Scala, worked in infrastructure to support it etc.
>>
>> Then I went back (in 2017) to HPC. I was surprised to find that not
>> much has changed - researchers and grad students still write code in
>> MPI and C/C++ and maybe some Python or R for visualization or
>> localized data analytics. I also noticed that it was not easy to
>> "marry" things like big data with HPC clusters - tools like
>> Spark/Hadoop do not really have the same underlying infrastructure
>> assumptions as do things like MPI/supercomputers. However, I find it
>> wasteful for a university to run separate clusters to support a data
>> science/big data load vs traditional HPC.
>>
>> I then stumbled upon languages like Julia - I like its approach, code
>> is data, visualization is easy, decent ML/DS tooling.
>>
>> How does it fare on a traditional HCP cluster? Are people using it to
>> substitute their MPI loads? On the opposite side, has it caught up to
>> Spark in terms of DS/ML quality of offering? In other words, can it be
>> used as a one fell swoop unifying substitute for both opposing
>> approaches?
>>
>> I realize that many people have already committed to certain
>> tech/paradigms but this is mostly educational debt (if MPI or Spark on
>> the other side is working for me, why go to something different?) -
>> but is there anything substantial stopping new people with no debt
>> starting out in a different approach (offerings like Julia)?
>>
>> I do not have too much experience with Julia (and hence may be barking
>> at the wrong tree) - in that case I am wondering what people are doing
>> to "marry" the loads of traditional HPC with "big data" as practiced
>> by the commercial/industry entities on a single underlying hardware
>> offering. I know there are things like Twister2 but it is unclear to
>> me (from cursory
>> examination) what it actually offers in the context of my questions
>> above.
>>
>> Any input, corrections, schooling me etc. are appreciated.
>>
>> Thank you!
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing To change your subscription (digest mode or unsubscribe)
>> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>
>
> --
> Doug
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>


-- 
Doug



More information about the Beowulf mailing list