[Beowulf] Beowulf Cluster VS Hadoop/Spark
deadline at eadline.org
Fri Dec 30 12:41:29 PST 2016
> I suspect that you can take any hadoop/spark application and give it to a
> good C/C++/OpenMp/MPI coder and in six months, a year, two years,..., you
> will end up with a much faster and much more efficient application.
> Meanwhile the original question the application was answering very likely
> won't matter to those who originally used hadoop/spark to answer it.
There are some problems that may merit programmer costs of that
magnitude, then there are others that can tolerate a little
inefficiency because the application works and scales now.
> It's worth keeping in mind that a lot (maybe most) of the "big data"
> analysis being done is done in Microsoft Excel.
Marketing aside, they call it big data for a reason.
Performing data analytics on a laptop is perfectly feasible,
and a good way to get a feel for you data. If you need to
turn your app loose on Tbytes of real data generated each day
then you may want to use something a bit heftier.
> Python and R cover a big
> chunk of it, often running on laptops.
PySpark runs on a laptop and if you data grows to where you need
to scale, PySpark scales.
>I assume anyone reading a post on
> this list, like me, suffers from "cluster bias" which causes us to forget
> that the bulk of computational work taking place in the world happens
> outside, far outside, of the top 100 machines. And much of that work is
> done by people who care more about the total time to solution and will
> happily trade a little additional CPU time for a better, easier and more
> powerful abstraction to use to ask questions.
> Consider also the increasing amount of this work being done by training a
> deep learning framework after which the researcher may or may not be able
> to explain how/why the thing works. Port that to C :)
> In general I always get suspicious at any suggestion of a pure approach to
> anything. As with "centralization" efforts in the world of IT, a pure
> approach is often code for "arbitrary boundaries for you that we are
> comfortable with." Hadoop/spark are great data exploration tools and
> someone who understands their data and knows Python can do wonderful
> in a Python notebook backed by an appropriately sized spark cluster and
> then be off to the next question before "hello world" can be compiled in
> I for one welcome our new big data overlords, unless they demand to run
> Excel on the cluster.
> Good thing it's close to my bedtime, I have exhausted my daily buzzword
> On Fri, Dec 30, 2016, 11:00 AM Jonathan Aquilina <jaquilina at eagleeyet.net>
>> Thanks John for your reply. Very interesting food for thought here. What
>> do understand between hadoop and spark is that spark is intended, i
>> be wrong here, as a replacement to hadoop as it performs better and
>> then hadoop.
>> Is spark also java based? I never thought java to be so high performant.
>> know when i started learning to program in java (java6) it was slow and
>> clunky. Wouldnt it be better to stick with a pure beowulf cluster and
>> yoru apps in c or c++ something that is closer to the machine language
>> the use of an interpreted language such as java? I think where I fall
>> to understand is how with hadoop and spark have they made java so quick
>> compared to a compiled language.
>> On 2016-12-30 08:47, John Hanks wrote:
>> This often gets presented as an either/or proposition and it's really
>> We happily use SLURM to schedule the setup, run and teardown of spark
>> clusters. At the end of the day it's all software, even the kernel and
>> The big secret of HPC is that in a job scheduler we have an amazingly
>> powerful tool to manage resources. Once you are scheduling spark
>> hadoop clusters, VMs as jobs, containers, long running web services,
>> you begin to feel sorry for those poor "cloud" people trapped in
>> But, directly to your question what we are learning as we dive deeper
>> spark (interest in hadoop here seems to be minimal and fading) is that
>> is just as hard or maybe harder to tune for than MPI and the people who
>> want to use it tend to have a far looser grasp of how to tune it than
>> using MPI. In the short term I think it is beneficial as a sysadmin to
>> spend some time learning the inner squishy bits to compensate for that.
>> simple wordcount example or search can show that wc and grep can often
>> outperform spark and it takes some experience to understand when a
>> particular approach is the better one for a given problem. (Where better
>> measured by efficiency, not by the number of cool new technical toys
>> employed :)
>> On Fri, Dec 30, 2016, 9:32 AM Jonathan Aquilina
>> <jaquilina at eagleeyet.net>
>> Hi All,
>> Seeing the new activity about new clusters for 2017, this sparked a
>> thought in my mind here. Beowulf Cluster vs hadoop/spark
>> In this day and age given that there is the technology with hadoop and
>> spark to crunch large data sets, why build a cluster of pc's instead of
>> something like hadoop/spark?
>> Happy New Year
>> Jonathan Aquilina
>> Owner EagleEyeT
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> '[A] talent for following the ways of yesterday, is not sufficient to
>> improve the world of today.'
>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
> â[A] talent for following the ways of yesterday, is not sufficient to
> improve the world of today.â
> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
> Mailscanner: Clean
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf