[Beowulf] SC13 wrapup, please post your own

Joe Landman landman at scalableinformatics.com
Sat Nov 23 12:13:40 PST 2013

On 11/23/2013 03:01 PM, Jonathan Dursi wrote:
> On Nov 23, 2013, at 1:40PM, Joe Landman
> <landman at scalableinformatics.com> wrote:
>> That is, we as a community have much to offer the growing big data
>> community.
> I think this is completely true, and somewhat urgent.  The two
> communities have a lot to teach each other.
> The big data community remains incredibly naive about a lot of
> performance/scalability issues - and of course they are, they’ve only
> been at this a few years.  Traditional HPC has a *lot* of hard-won
> knowledge and experience to offer.
> But conversely, where we’ve been naive is the importance of easily
> deployable, scalable, easy-to-develop-for software frameworks, even
> if it initially comes at substantial cost in terms of
> single-processor performance.  If we choose not to learn the lessons
> of rapid growth of tools like Hadoop, we are in trouble as a
> community.


> We’ve talked for years about how hardware is advancing more rapidly
> than software, but not done much about it; now someone has, and it’s
> not us.  As a result, people are already trying to fit very HPCy
> sorts of problems into Hadoopy sorts of frameworks (cf, all the BSP
> stuff in Pregel or Hama) because it’s so much easier to get things
> working, and so much easier to find developers to maintain.  When it
> comes to choosing a direction for a new project, 100x the number of
> developers will always win over single-processor performance, or even
> scaling, because you can then direct enormous amounts of resources to
> fixing performance issues in the underlying frameworks.

I am a huge believer in plug-in-turn-on-walk-away.  That should be all 
there is to configuration.  Cluster distros should be gone.  Not that 
Chef/Puppet are the right way to go (there are many reasons why they 
aren't IMO), but there are some fantastic concept coming in from the 
cloud side (Docker.io, smartos, ...) that we need to collectively leverage.

But likewise, we still don't quite have resilient computation down, 
among other things.  Checkpointing a job is, in many cases, simply not a 
viable option.  Our job schedulers are cool, but designed for a 
different era.

> Jonathan

