[Beowulf] [External] Spark, Julia, OpenMPI etc. - all in one place

Mon Oct 12 12:19:17 PDT 2020

I'm not an expert on Big Data at all, but I hear the phrase "Hadoop" 
less and less these days. Where I work, most data analysts are using R, 
Python, or Spark in the form of PySpark. For machine learning, most of 
the researchers I support are using Python tools like TensorFlow or 
PyTorch.

I don't know much about Julia replacing MPI, etc., but I wish I did. I 
would like to know more about Julia.

Prentice

On 10/12/20 12:14 PM, Oddo Da wrote:
> Hello,
>
> I used to be in HPC back when we built beowulf clusters by hand ;) and 
> wrote code in C/pthreads, PVM and MPI and back when anyone could walk 
> into fields like bioinformatics, all that was needed was a pulse, some 
> C and Perl and a desire to do ;-). Then I left for the private sector 
> and stumbled into "big data" some years later - I wrote a lot of code 
> in Spark and Scala, worked in infrastructure to support it etc.
>
> Then I went back (in 2017) to HPC. I was surprised to find that not 
> much has changed - researchers and grad students still write code in 
> MPI and C/C++ and maybe some Python or R for visualization or 
> localized data analytics. I also noticed that it was not easy to 
> "marry" things like big data with HPC clusters - tools like 
> Spark/Hadoop do not really have the same underlying infrastructure 
> assumptions as do things like MPI/supercomputers. However, I find it 
> wasteful for a university to run separate clusters to support a data 
> science/big data load vs traditional HPC.
>
> I then stumbled upon languages like Julia - I like its approach, code 
> is data, visualization is easy, decent ML/DS tooling.
>
> How does it fare on a traditional HCP cluster? Are people using it to 
> substitute their MPI loads? On the opposite side, has it caught up to 
> Spark in terms of DS/ML quality of offering? In other words, can it be 
> used as a one fell swoop unifying substitute for both opposing 
> approaches?
>
> I realize that many people have already committed to certain 
> tech/paradigms but this is mostly educational debt (if MPI or Spark on 
> the other side is working for me, why go to something different?) - 
> but is there anything substantial stopping new people with no debt 
> starting out in a different approach (offerings like Julia)?
>
> I do not have too much experience with Julia (and hence may be barking 
> at the wrong tree) - in that case I am wondering what people are doing 
> to "marry" the loads of traditional HPC with "big data" as practiced 
> by the commercial/industry entities on a single underlying hardware 
> offering. I know there are things like Twister2 but it is unclear to 
> me (from cursory examination) what it actually offers in the context 
> of my questions above.
>
> Any input, corrections, schooling me etc. are appreciated.
>
> Thank you!
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20201012/5b12c3a5/attachment.html>