[Beowulf] [External] Spark, Julia, OpenMPI etc. - all in one place
Jonathan Aquilina
jaquilina at eagleeyet.net
Tue Oct 13 05:33:47 PDT 2020
Hi Guys,
How does Hadoop manage to crunch large data sets what makes it the go to platform and its siblings for big data?
Regards,
Jonathan
-----Original Message-----
From: Beowulf <beowulf-bounces at beowulf.org> On Behalf Of Michael Di Domenico
Sent: 13 October 2020 14:32
Cc: Beowulf Mailing List <beowulf at beowulf.org>
Subject: Re: [Beowulf] [External] Spark, Julia, OpenMPI etc. - all in one place
i can't speak from a general industry sense, but i've had everything run through my center over the past 11 years. Hadoop seemed like something that was going to take off. it didn't with my group of users. we aren't counting clicks nor parsing text from huge files, so its utility to us faded. my understanding is the group behind hadoop also made several industry missteps when trying to commercialize, i'm not sure what happened after that. i think a lot people realized that hadoop made things easier, but the overhead was too high given the limited functionality most people wanted to use it for
the julia users i have range from, eh this is cool, to fanatical (vim vs emacs). my take is that the language is nice, but as they try to commercial the product i can see some missteps coming as well (aka julia teams)
as for the traditional stuff, it just works. mpi is becoming a bit like cobol. everyone claims it's dead, but yet it's still around
right now, AI is big, so python seems to be the language desure at the moment. python seems to have taken a much strong hold then julia even with the non-ai folks i have
On Tue, Oct 13, 2020 at 8:14 AM Oddo Da <oddodaoddo at gmail.com> wrote:
>
> Jim, Peter: by things have not changed in the tooling I meant that it is the same approach/paradigm as it was when I was in HPC back in the late 1990s/early 2000s. Even if you look at books about OpenMPI, you can go on their mailing list and ask what books to read and you will be pointed to the same stuff published 20+ years ago and maybe there are one or two books that are "fresher" than that (I did that a few months ago, naively thinking that things have changed ;-) ).
>
> The approach is still the same - you have to write the code at the low level and worry about everything. It would be nice if this was improved and things were abstracted up and away a bit. The appearance of Spark, for example, did exactly that for data science/machine learning/"big data" - esp. when you write it in Scala (functional programming) - it just makes for all sorts of cleaner, abstracted, more correct code where the framework worries about the underlying data/computation locality, the communication between all the machinery etc. etc. and you are left to worry about the problem you are solving. I just feel that in the HPC world we have not moved to this point yet and am trying to understand why.
>
> I mean, let's say I was a data science researcher at a university and all that was on offer was the traditional HPC cluster - what tooling would I use to do my research? The whole world is doing something else but I am stuck worrying about the low level details.... or I need to ask for a separate HDFS/Spark cluster? What if I want to stream data from somewhere like it is done commonly in the industry (solutions like Kafka etc.) - my only option is to stand up a local cluster (costs time, money, ongoing admin/maintenance) or to go to AWS or Azure and spend tax payer money to fill corporate coffers for what should a;ready be a solved problem with the money that was spent for all the hardware at the University already?
>
> BTW, Spark is just an example of how tooling/methodologies have improved in the industry in the domain of distributed computation. This is why I thought that Julia may be one of those things that provides a different (improved?) way of doing things where both the climate modeling guys and the data science guys can utilize the same HPC hardware....
>
> On Tue, Oct 13, 2020 at 4:49 AM Jim Cownie <jcownie at gmail.com> wrote:
>>
>> It just seems to me that things have not really changed in the tooling in the HPC space since 20+ years ago.
>>
>>
>> It's also worth pointing out that the OpenMP of the year 2000 (OpenMP 2.0) is not the OpenMP of 2020 (OpenMP 5.1), (just as C++20 is not C++98), and, similarly MPI has also advanced in the last twenty years (as has Fortran).
>>
>> Just because the name is the same does not mean that the specification and its capabilities are the same.
>>
>> Taking OpenMP, off the top of my head, major changes include all of
>> the support for
>>
>> Offloading computation to target devices (normally GPUs at present).
>> Tasking (including task dependencies) Vectorisation directives
>>
>> (there are undoubtedly many other changes; heck, in 2000 the standard
>> was 124pp of Fortran + 85pp for CV, whereas the TR for 5.a is 715pp,
>> so there’s a lot more in there!)
>>
>> “Rip it up and start again” https://www.youtube.com/watch?v=UzPh89tD5pA is not always the best approach, and those of us who were around in the 80s and 90s did know a few things even back then!
>>
>> -- Jim
>> James Cownie <jcownie at gmail.com>
>> Mob: +44 780 637 7146
>>
>> On 13 Oct 2020, at 09:21, Peter Kjellström <cap at nsc.liu.se> wrote:
>>
>> On Mon, 12 Oct 2020 22:04:30 -0400
>> Oddo Da <oddodaoddo at gmail.com> wrote:
>>
>> Johann-Tobias,
>>
>> Thank you for the reply.
>>
>> I don't know enough detail about Julia to even be confused (I am
>> learning it now) :-)
>>
>> It just seems to me that things have not really changed in the
>> tooling in the HPC space since 20+ years ago. This is why I thought,
>> well, hoped that something new and more interesting would have come
>> along - like Julia. Being able to express better and at a higher
>> level parallelization or distribution tasks (higher than MPI anyway)
>> would be nice. Spark is nice that way in the data science space but
>> it cannot run in the same space/hardware as traditional HPC
>> approaches, sadly.
>>
>>
>> Well quite a few things have "come along" but there's soo much
>> inertia in C/C++/Fortran with OpenMP (and/)or MPI that new things are
>> pretty much invisible if you look at what's run on an everyday basis...
>>
>> From this perspective (and my vantage point in national academic HPC)
>> it seems the only significant change over then last 10-20 years is
>> Python use (as more or less complete applications, glue, interactive
>> work, ...). But also the scale of parallelism (used to be 10-100 on a
>> system, now 10000-100000).
>>
>> For things that have "come along" but not quite altered the big
>> picture
>> (much?) (yet?) (ever?) here are a few:
>>
>> * Chapel (https://chapel-lang.org/)
>> * OpenMP and OpenACC for GPU use
>> * HPX (https://stellar-group.org/)
>> * Legion (https://legion.stanford.edu/)
>> * Mooaaaar_taaaasks(MPI+OpenMP) -> TAMPI+OmpSS2 (https://pm.bsc.es/)
>>
>> /Peter
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing To change your subscription (digest mode or unsubscribe)
>> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>>
>>
>>
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing To change your subscription (digest mode or unsubscribe)
> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
More information about the Beowulf
mailing list