<div dir="auto"><div>I pictured Doug standing in front of a crowd of problems. He shouts "you do not have to follow my methods, you are all individuals". The crowd replies in unison "we are all individuals", then one problem stands and says "I'm not". (Apologies to any non-monty Python fans)<div dir="auto"><br></div><div dir="auto"><div dir="auto">There is a theoretical optimum solution for every problem, the point being that there's no sense letting pursuit of that solution block the use of a good heuristic to just get things moving. I think we are both saying the same thing, me with tongue in cheek cynicism and sarcasm and you with applied knowledge and information. </div><div dir="auto"><br></div><div dir="auto">However, I do tend to interpret the "big" in "big data" to mean the impact and outcomes resulting downstream of the analytics rather than as the size of the inputs. It's the age old argument whether it's the size that matters or how well you use it. Perhaps I have a natural bias toward how you use it because I work with a small dataset.</div><div dir="auto"><br></div><div dir="auto">jbh</div><div dir="auto"><br></div></div><br><div class="gmail_extra"><br><div class="gmail_quote">On Dec 30, 2016 11:41 PM, "Douglas Eadline" <<a href="mailto:deadline@eadline.org">deadline@eadline.org</a>> wrote:<br type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="quoted-text"><br>
> I suspect that you can take any hadoop/spark application and give it to a<br>
> good C/C++/OpenMp/MPI coder and in six months, a year, two years,..., you<br>
> will end up with a much faster and much more efficient application.<br>
> Meanwhile the original question the application was answering very likely<br>
> won't matter to those who originally used hadoop/spark to answer it.<br>
<br>
</div>There are some problems that may merit programmer costs of that<br>
magnitude, then there are others that can tolerate a little<br>
inefficiency because the application works and scales now.<br>
<div class="quoted-text"><br>
><br>
> It's worth keeping in mind that a lot (maybe most) of the "big data"<br>
> analysis being done is done in Microsoft Excel.<br>
<br>
</div>Marketing aside, they call it big data for a reason.<br>
Performing data analytics on a laptop is perfectly feasible,<br>
and a good way to get a feel for you data. If you need to<br>
turn your app loose on Tbytes of real data generated each day<br>
then you may want to use something a bit heftier.<br>
<div class="quoted-text"><br>
> Python and R cover a big<br>
> chunk of it, often running on laptops.<br>
<br>
</div>PySpark runs on a laptop and if you data grows to where you need<br>
to scale, PySpark scales.<br>
<br>
--<br>
Doug<br>
<div class="elided-text"><br>
>I assume anyone reading a post on<br>
> this list, like me, suffers from "cluster bias" which causes us to forget<br>
> that the bulk of computational work taking place in the world happens<br>
> outside, far outside, of the top 100 machines. And much of that work is<br>
> done by people who care more about the total time to solution and will<br>
> happily trade a little additional CPU time for a better, easier and more<br>
> powerful abstraction to use to ask questions.<br>
><br>
> Consider also the increasing amount of this work being done by training a<br>
> deep learning framework after which the researcher may or may not be able<br>
> to explain how/why the thing works. Port that to C :)<br>
><br>
> In general I always get suspicious at any suggestion of a pure approach to<br>
> anything. As with "centralization" efforts in the world of IT, a pure<br>
> approach is often code for "arbitrary boundaries for you that we are<br>
> comfortable with." Hadoop/spark are great data exploration tools and<br>
> someone who understands their data and knows Python can do wonderful<br>
> things<br>
> in a Python notebook backed by an appropriately sized spark cluster and<br>
> then be off to the next question before "hello world" can be compiled in<br>
> C.<br>
> I for one welcome our new big data overlords, unless they demand to run<br>
> Excel on the cluster.<br>
><br>
> Good thing it's close to my bedtime, I have exhausted my daily buzzword<br>
> quota.<br>
><br>
> jbh<br>
><br>
> On Fri, Dec 30, 2016, 11:00 AM Jonathan Aquilina <<a href="mailto:jaquilina@eagleeyet.net">jaquilina@eagleeyet.net</a>><br>
> wrote:<br>
><br>
>> Thanks John for your reply. Very interesting food for thought here. What<br>
>> I<br>
>> do understand between hadoop and spark is that spark is intended, i<br>
>> could<br>
>> be wrong here, as a replacement to hadoop as it performs better and<br>
>> faster<br>
>> then hadoop.<br>
>><br>
>> Is spark also java based? I never thought java to be so high performant.<br>
>> I<br>
>> know when i started learning to program in java (java6) it was slow and<br>
>> clunky. Wouldnt it be better to stick with a pure beowulf cluster and<br>
>> build<br>
>> yoru apps in c or c++ something that is closer to the machine language<br>
>> then<br>
>> the use of an interpreted language such as java? I think where I fall<br>
>> short<br>
>> to understand is how with hadoop and spark have they made java so quick<br>
>> compared to a compiled language.<br>
>><br>
>><br>
>><br>
>> On 2016-12-30 08:47, John Hanks wrote:<br>
>><br>
>> This often gets presented as an either/or proposition and it's really<br>
>> not.<br>
>> We happily use SLURM to schedule the setup, run and teardown of spark<br>
>> clusters. At the end of the day it's all software, even the kernel and<br>
>> OS.<br>
>> The big secret of HPC is that in a job scheduler we have an amazingly<br>
>> powerful tool to manage resources. Once you are scheduling spark<br>
>> clusters,<br>
>> hadoop clusters, VMs as jobs, containers, long running web services,<br>
>> ....,<br>
>> you begin to feel sorry for those poor "cloud" people trapped in<br>
>> buzzword<br>
>> land.<br>
>><br>
>> But, directly to your question what we are learning as we dive deeper<br>
>> into<br>
>> spark (interest in hadoop here seems to be minimal and fading) is that<br>
>> it<br>
>> is just as hard or maybe harder to tune for than MPI and the people who<br>
>> want to use it tend to have a far looser grasp of how to tune it than<br>
>> those<br>
>> using MPI. In the short term I think it is beneficial as a sysadmin to<br>
>> spend some time learning the inner squishy bits to compensate for that.<br>
>> A<br>
>> simple wordcount example or search can show that wc and grep can often<br>
>> outperform spark and it takes some experience to understand when a<br>
>> particular approach is the better one for a given problem. (Where better<br>
>> is<br>
>> measured by efficiency, not by the number of cool new technical toys<br>
>> were<br>
>> employed :)<br>
>><br>
>> jbh<br>
>><br>
>> On Fri, Dec 30, 2016, 9:32 AM Jonathan Aquilina<br>
>> <<a href="mailto:jaquilina@eagleeyet.net">jaquilina@eagleeyet.net</a>><br>
>> wrote:<br>
>><br>
>> Hi All,<br>
>><br>
>> Seeing the new activity about new clusters for 2017, this sparked a<br>
>> thought in my mind here. Beowulf Cluster vs hadoop/spark<br>
>><br>
>> In this day and age given that there is the technology with hadoop and<br>
>> spark to crunch large data sets, why build a cluster of pc's instead of<br>
>> use<br>
>> something like hadoop/spark?<br>
>><br>
>><br>
>><br>
>> Happy New Year<br>
>><br>
>> Jonathan Aquilina<br>
>><br>
>> Owner EagleEyeT<br>
>> ______________________________<wbr>_________________<br>
>> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
>> To change your subscription (digest mode or unsubscribe) visit<br>
>> <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/<wbr>mailman/listinfo/beowulf</a><br>
>><br>
>> --<br>
>> '[A] talent for following the ways of yesterday, is not sufficient to<br>
>> improve the world of today.'<br>
>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC<br>
>><br>
>> --<br>
> ‘[A] talent for following the ways of yesterday, is not sufficient to<br>
> improve the world of today.’<br>
> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC<br>
><br>
</div>> --<br>
> Mailscanner: Clean<br>
<div class="quoted-text">><br>
> ______________________________<wbr>_________________<br>
> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
> To change your subscription (digest mode or unsubscribe) visit<br>
> <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/<wbr>mailman/listinfo/beowulf</a><br>
><br>
<br>
<br>
--<br>
</div>Doug<br>
<font color="#888888"><br>
--<br>
Mailscanner: Clean<br>
<br>
</font></blockquote></div><br></div></div></div>