[Beowulf] Hadoop

Joe Landman landman at scalableinformatics.com
Sat Dec 27 08:11:20 PST 2008

Jeff Layton wrote:

> BTW - I saw Karen's post about using Java with HadoopFS. Be sure to pay 
> attention to that since getting a good 64-bit Java implementation for 
> Linux is not always easy. There are a few out there (Sun has an early 
> access program to a 64-bit Java) but the reports I've heard are that 
> it's still early.

Yeah, 64 bit java is sorta-kinda working.  Sun just released a 64 bit 
Java plugin for nsapi (e.g. firefox/mozilla) oh, only ... 5 years after 
the first RFE.  Not sure how well baked it is, I am playing with it for 
some of our customers.

64 bit Java shouldn't be hard, as Java VM's are supposed to hide details 
of the underlying architecture.  It is a VM.  But at the end of day, 
there could be (considerable) differences in execution due to object 
size differences ... that is, unless you completely ignore the 
underlying native intrinsic data sizes, your execution could have some 
... er ... unexpected results.

Which might by why Java 64 is so hard to create.  They work so hard to 
hide the details of the underlying system (OS, CPU, memory, IO, network) 
from you, that in moving to a new ABI, there is so much to change, that 
it is ... non-trivial ... to do so.

This said, I hear of Java's use in HPC every now and then.  Some apps 
are interesting in that they leverage some capability of the underlying 
platform, like the Pervasive Software DataRush effort, and allow you to 
hide latency by massively threading their analyses.  But as we have 
noted to many, I don't see the great unwashed masses/hoards of HPC 
developers rushing to Java due to its (many) downsides.  There is a real 
tangible measurable performance penalty to abstraction.  Introduce too 
much and you spend more time traversing the abstraction classes than you 
do doing the computation.  Heck, we can't even write good compilers for 
non-OO code (e.g. compilers that generate near optimal instruction paths 
on existing CPUs on significant fraction of HPC code bases).  Are we 
expecting to write even better JIT compilers and optimizers to solve a 
more difficult problem than the one we have basically punted on?

I am a huge believer in programmer productivity (though I dispute the 
notion that Java's incredibly draconian type system coupled with its 
verbosity actually contributes to productivity), but underlying code 
performance is still one of the most important aspects in HPC.  DataRush 
solves this by hiding latency of each task, by having so many tasks to 
work on.  Sort of a Java version (weak analogy) of the old Tera MTA 
system.  Other codes like Hadoop could do similar things ... schedule so 
much work that some actually gets done.

A nascent (yet very real) problem for Java in addition to the above 
mentioned, for HPC usage going forward, is their complete lack of 
support for accelerators.  Maybe someday, in another decade or so, they 
will start to support GPU computation ...  not talking about OpenCL 
support, but real execution on the many cores that accelerators supply. 
  The underlying architecture is changing fast enough that I don't think 
they can keep up.  And end users want the performance.  This provides a 
net incentive not to use Java, as it can't currently (or in the 
foreseeable future) support the emerging personal supercomputing systems 
with accelerators.  Sure it can run on the CPUs, but then like all other 
codes, it runs head first into the memory wall, the IO bandwidth walls, 
and so forth.

Sun of course will claim that the trick is to massively multithread the 
code, which means you don't focus on individual thread performance but 
on overall throughput.  Which somewhat flies in the face of what HPC 
developers have been talking about for decades (tune for single 
processors first, then for parallel).

So I won't disparage the users or use of Java in HPC, other than to note 
that the future on that platform in HPC may not be as bright as some 
marketeers might suggest.

N.B. the recent MPI class we gave suggested that we need to re-tool it 
to focus more upon Fortran than C.  There was no interest in Java from 
the class I polled.  Some researchers want to use Matlab for their work, 
but most university computing facilities are loathe to spend the money 
to get site licenses for Matlab.  Unfortunate, as Matlab is a very cool 
tool (been playing with it first in 1988 ...) its just not fast.  The 
folks at Interactive Supercomputing might be able to help with this with 
their compiler.

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615

More information about the Beowulf mailing list