<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>I'm not an expert on Big Data at all, but I hear the phrase
"Hadoop" less and less these days. Where I work, most data
analysts are using R, Python, or Spark in the form of PySpark. For
machine learning, most of the researchers I support are using
Python tools like TensorFlow or PyTorch. <br>
</p>
<p>I don't know much about Julia replacing MPI, etc., but I wish I
did. I would like to know more about Julia. <br>
</p>
<p>Prentice<br>
</p>
<div class="moz-cite-prefix">On 10/12/20 12:14 PM, Oddo Da wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CALFK+OaCvNq7txN6OnkAQz5Xv3VrDSDrEpkU2UgmOOkqMH6jCg@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div>Hello,</div>
<div><br>
</div>
<div>I used to be in HPC back when we built beowulf clusters by
hand ;) and wrote code in C/pthreads, PVM and MPI and back
when anyone could walk into fields like bioinformatics, all
that was needed was a pulse, some C and Perl and a desire to
do ;-). Then I left for the private sector and stumbled into
"big data" some years later - I wrote a lot of code in Spark
and Scala, worked in infrastructure to support it etc.</div>
<div><br>
</div>
<div>Then I went back (in 2017) to HPC. I was surprised to find
that not much has changed - researchers and grad students
still write code in MPI and C/C++ and maybe some Python or R
for visualization or localized data analytics. I also noticed
that it was not easy to "marry" things like big data with HPC
clusters - tools like Spark/Hadoop do not really have the same
underlying infrastructure assumptions as do things like
MPI/supercomputers. However, I find it wasteful for a
university to run separate clusters to support a data
science/big data load vs traditional HPC.<br>
</div>
<div><br>
</div>
<div>I then stumbled upon languages like Julia - I like its
approach, code is data, visualization is easy, decent ML/DS
tooling. <br>
</div>
<div><br>
</div>
<div>How does it fare on a traditional HCP cluster? Are people
using it to substitute their MPI loads? On the opposite side,
has it caught up to Spark in terms of DS/ML quality of
offering? In other words, can it be used as a one fell swoop
unifying substitute for both opposing approaches? <br>
</div>
<div><br>
</div>
<div>I realize that many people have already committed to
certain tech/paradigms but this is mostly educational debt (if
MPI or Spark on the other side is working for me, why go to
something different?) - but is there anything substantial
stopping new people with no debt starting out in a different
approach (offerings like Julia)?</div>
<div><br>
</div>
<div>I do not have too much experience with Julia (and hence may
be barking at the wrong tree) - in that case I am wondering
what people are doing to "marry" the loads of traditional HPC
with "big data" as practiced by the commercial/industry
entities on a single underlying hardware offering. I know
there are things like Twister2 but it is unclear to me (from
cursory examination) what it actually offers in the context of
my questions above.<br>
</div>
<div><br>
</div>
<div> Any input, corrections, schooling me etc. are appreciated.</div>
<div><br>
</div>
<div>Thank you!<br>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
Beowulf mailing list, <a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit <a class="moz-txt-link-freetext" href="https://beowulf.org/cgi-bin/mailman/listinfo/beowulf">https://beowulf.org/cgi-bin/mailman/listinfo/beowulf</a>
</pre>
</blockquote>
<pre class="moz-signature" cols="72">--
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
<a class="moz-txt-link-freetext" href="http://www.pppl.gov">http://www.pppl.gov</a></pre>
</body>
</html>