[Beowulf] Spark, Julia, OpenMPI etc. - all in one place

Oddo Da oddodaoddo at gmail.com
Mon Oct 12 09:14:25 PDT 2020


Hello,

I used to be in HPC back when we built beowulf clusters by hand ;) and
wrote code in C/pthreads, PVM and MPI and back when anyone could walk into
fields like bioinformatics, all that was needed was a pulse, some C and
Perl and a desire to do ;-). Then I left for the private sector and
stumbled into "big data" some years later - I wrote a lot of code in Spark
and Scala, worked in infrastructure to support it etc.

Then I went back (in 2017) to HPC. I was surprised to find that not much
has changed - researchers and grad students still write code in MPI and
C/C++ and maybe some Python or R for visualization or localized data
analytics. I also noticed that it was not easy to "marry" things like big
data with HPC clusters - tools like Spark/Hadoop do not really have the
same underlying infrastructure assumptions as do things like
MPI/supercomputers. However, I find it wasteful for a university to run
separate clusters to support a data science/big data load vs traditional
HPC.

I then stumbled upon languages like Julia - I like its approach, code is
data, visualization is easy, decent ML/DS tooling.

How does it fare on a traditional HCP cluster? Are people using it to
substitute their MPI loads? On the opposite side, has it caught up to Spark
in terms of DS/ML quality of offering? In other words, can it be used as a
one fell swoop unifying substitute for both opposing approaches?

I realize that many people have already committed to certain tech/paradigms
but this is mostly educational debt (if MPI or Spark on the other side is
working for me, why go to something different?) - but is there anything
substantial stopping new people with no debt starting out in a different
approach (offerings like Julia)?

I do not have too much experience with Julia (and hence may be barking at
the wrong tree) - in that case I am wondering what people are doing to
"marry" the loads of traditional HPC with "big data" as practiced by the
commercial/industry entities on a single underlying hardware offering. I
know there are things like Twister2 but it is unclear to me (from cursory
examination) what it actually offers in the context of my questions above.

Any input, corrections, schooling me etc. are appreciated.

Thank you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20201012/fb9d82c7/attachment.html>


More information about the Beowulf mailing list