[Beowulf] ***UNCHECKED*** Re: Spark, Julia, OpenMPI etc. - all in one place

Jonathan Aquilina jaquilina at eagleeyet.net
Tue Oct 13 07:12:07 PDT 2020


Hi Doug,

How have they managed to squeeze so much performance out of java for such big data sets?

Regards,
Jonathan

-----Original Message-----
From: Beowulf <beowulf-bounces at beowulf.org> On Behalf Of Douglas Eadline
Sent: 13 October 2020 15:55
To: Oddo Da <oddodaoddo at gmail.com>
Cc: beowulf at beowulf.org
Subject: [Beowulf] ***UNCHECKED*** Re: Spark, Julia, OpenMPI etc. - all in one place


I have noticed a lot of Hadoop/Spark references in the replies.
The word "Hadoop" is probably the most misunderstood word in computing today and may people have a somewhat vague idea what it actually is.

Hadoop V1 was a monolithic Map Reduce framework written in Java. (BTW Map Reduce is a SIMD algorithm)

Hadoop V2 the Map Reduce component was separated from the scheduler (YARN) and the underlying distributed file systems (HDFS) It is best thought of as a "platform" for developing big data systems. The most popular Map Reduce application is Hive.
Developed by Facebook, it allow relational databases to be run at scale.

Hadoop V3 and beyond is moving more toward a true cloud based environment with a new file systems called Ozone. Note, the need for HDFS made cloud migration difficult

Spark is a completely separate code base that has its own Map Reduce engine. It can work stand-alone, with the YARN scheduler, or with other schedulers. It can also take advantage of HDFS.

Spark is language, Hadoop is platform. Map Reduce is SIMD algorithm that works well with large amounts of read-only data.

There is more to it, but that is the gist of it.

--
Doug

> Hello,
>
> I used to be in HPC back when we built beowulf clusters by hand ;) and 
> wrote code in C/pthreads, PVM and MPI and back when anyone could walk 
> into fields like bioinformatics, all that was needed was a pulse, some 
> C and Perl and a desire to do ;-). Then I left for the private sector 
> and stumbled into "big data" some years later - I wrote a lot of code 
> in Spark and Scala, worked in infrastructure to support it etc.
>
> Then I went back (in 2017) to HPC. I was surprised to find that not 
> much has changed - researchers and grad students still write code in 
> MPI and C/C++ and maybe some Python or R for visualization or 
> localized data analytics. I also noticed that it was not easy to 
> "marry" things like big data with HPC clusters - tools like 
> Spark/Hadoop do not really have the same underlying infrastructure 
> assumptions as do things like MPI/supercomputers. However, I find it 
> wasteful for a university to run separate clusters to support a data 
> science/big data load vs traditional HPC.
>
> I then stumbled upon languages like Julia - I like its approach, code 
> is data, visualization is easy, decent ML/DS tooling.
>
> How does it fare on a traditional HCP cluster? Are people using it to 
> substitute their MPI loads? On the opposite side, has it caught up to 
> Spark in terms of DS/ML quality of offering? In other words, can it be 
> used as a one fell swoop unifying substitute for both opposing 
> approaches?
>
> I realize that many people have already committed to certain 
> tech/paradigms but this is mostly educational debt (if MPI or Spark on 
> the other side is working for me, why go to something different?) - 
> but is there anything substantial stopping new people with no debt 
> starting out in a different approach (offerings like Julia)?
>
> I do not have too much experience with Julia (and hence may be barking 
> at the wrong tree) - in that case I am wondering what people are doing 
> to "marry" the loads of traditional HPC with "big data" as practiced 
> by the commercial/industry entities on a single underlying hardware 
> offering. I know there are things like Twister2 but it is unclear to 
> me (from cursory
> examination) what it actually offers in the context of my questions above.
>
> Any input, corrections, schooling me etc. are appreciated.
>
> Thank you!
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
> Computing To change your subscription (digest mode or unsubscribe) 
> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>


--
Doug

_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


More information about the Beowulf mailing list