[Beowulf] [External] Spark, Julia, OpenMPI etc. - all in one place

Mon Oct 12 16:11:27 PDT 2020

I have some experience with Julia and can say with certainty that
Julia is not aiming to replace MPI.
Julia is a programming language aiming for a place in HPC and other
development time limited computation heavy areas. Some Julia programs
also uses MPI for internode communication. Such CliMA [1] defines it's
own array type MPIStateArray which presents itself as a shared array
over all machines in the cluster but it handles synchronization ghost
elements so local stencil can use up to date data from their
neighbours machine and all that while hiding behind the interface of
an array data structure.
However Julia also has other means of internode communication, such as
a channel primitive that can send arbitrary data structure. So i see
where to confusion might come from.
Here is an instruction on how to use Julia on MITs Satori Cluster [2]
from that and the surrounding excitement of Julia. I think Julia is
slowly growing more popular in Super Computing space. Although the
number of projects running on large scale cluster can probably be
still counted on one hand.
I am aware of the following projects using Julia at Cluster Scale:
 - Celeste [3]
 - CLiMA
 - DSGE model of the FED [4]
 - model informed drug development [5]

If someone is interested in learning Julia, a good place to come into
contact with the community is the Julia Slack.

Sincerely,
Johann-Tobias Schäg

[1] https://github.com/CliMA/ClimateMachine.jl
[2] https://mit-satori.github.io/satori-julia.html#getting-started
[3] https://www.youtube.com/watch?v=uecdcADM3hY
[4] https://github.com/FRBNY-DSGE/DSGE.jl
[5] https://juliacomputing.com/case-studies/pfizer.html

On Mon, 12 Oct 2020 at 21:20, Prentice Bisbal via Beowulf
<beowulf at beowulf.org> wrote:
>
> I'm not an expert on Big Data at all, but I hear the phrase "Hadoop" less and less these days. Where I work, most data analysts are using R, Python, or Spark in the form of PySpark. For machine learning, most of the researchers I support are using Python tools like TensorFlow or PyTorch.
>
> I don't know much about Julia replacing MPI, etc., but I wish I did. I would like to know more about Julia.
>
> Prentice
>
> On 10/12/20 12:14 PM, Oddo Da wrote:
>
> Hello,
>
> I used to be in HPC back when we built beowulf clusters by hand ;) and wrote code in C/pthreads, PVM and MPI and back when anyone could walk into fields like bioinformatics, all that was needed was a pulse, some C and Perl and a desire to do ;-). Then I left for the private sector and stumbled into "big data" some years later - I wrote a lot of code in Spark and Scala, worked in infrastructure to support it etc.
>
> Then I went back (in 2017) to HPC. I was surprised to find that not much has changed - researchers and grad students still write code in MPI and C/C++ and maybe some Python or R for visualization or localized data analytics. I also noticed that it was not easy to "marry" things like big data with HPC clusters - tools like Spark/Hadoop do not really have the same underlying infrastructure assumptions as do things like MPI/supercomputers. However, I find it wasteful for a university to run separate clusters to support a data science/big data load vs traditional HPC.
>
> I then stumbled upon languages like Julia - I like its approach, code is data, visualization is easy, decent ML/DS tooling.
>
> How does it fare on a traditional HCP cluster? Are people using it to substitute their MPI loads? On the opposite side, has it caught up to Spark in terms of DS/ML quality of offering? In other words, can it be used as a one fell swoop unifying substitute for both opposing approaches?
>
> I realize that many people have already committed to certain tech/paradigms but this is mostly educational debt (if MPI or Spark on the other side is working for me, why go to something different?) - but is there anything substantial stopping new people with no debt starting out in a different approach (offerings like Julia)?
>
> I do not have too much experience with Julia (and hence may be barking at the wrong tree) - in that case I am wondering what people are doing to "marry" the loads of traditional HPC with "big data" as practiced by the commercial/industry entities on a single underlying hardware offering. I know there are things like Twister2 but it is unclear to me (from cursory examination) what it actually offers in the context of my questions above.
>
> Any input, corrections, schooling me etc. are appreciated.
>
> Thank you!
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
> --
> Prentice Bisbal
> Lead Software Engineer
> Research Computing
> Princeton Plasma Physics Laboratory
> http://www.pppl.gov
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf