[Beowulf] Hadoop
Joe Landman
landman at scalableinformatics.com
Sat Dec 27 08:11:20 PST 2008
Jeff Layton wrote:
> BTW - I saw Karen's post about using Java with HadoopFS. Be sure to pay
> attention to that since getting a good 64-bit Java implementation for
> Linux is not always easy. There are a few out there (Sun has an early
> access program to a 64-bit Java) but the reports I've heard are that
> it's still early.
Yeah, 64 bit java is sorta-kinda working. Sun just released a 64 bit
Java plugin for nsapi (e.g. firefox/mozilla) oh, only ... 5 years after
the first RFE. Not sure how well baked it is, I am playing with it for
some of our customers.
64 bit Java shouldn't be hard, as Java VM's are supposed to hide details
of the underlying architecture. It is a VM. But at the end of day,
there could be (considerable) differences in execution due to object
size differences ... that is, unless you completely ignore the
underlying native intrinsic data sizes, your execution could have some
... er ... unexpected results.
Which might by why Java 64 is so hard to create. They work so hard to
hide the details of the underlying system (OS, CPU, memory, IO, network)
from you, that in moving to a new ABI, there is so much to change, that
it is ... non-trivial ... to do so.
This said, I hear of Java's use in HPC every now and then. Some apps
are interesting in that they leverage some capability of the underlying
platform, like the Pervasive Software DataRush effort, and allow you to
hide latency by massively threading their analyses. But as we have
noted to many, I don't see the great unwashed masses/hoards of HPC
developers rushing to Java due to its (many) downsides. There is a real
tangible measurable performance penalty to abstraction. Introduce too
much and you spend more time traversing the abstraction classes than you
do doing the computation. Heck, we can't even write good compilers for
non-OO code (e.g. compilers that generate near optimal instruction paths
on existing CPUs on significant fraction of HPC code bases). Are we
expecting to write even better JIT compilers and optimizers to solve a
more difficult problem than the one we have basically punted on?
I am a huge believer in programmer productivity (though I dispute the
notion that Java's incredibly draconian type system coupled with its
verbosity actually contributes to productivity), but underlying code
performance is still one of the most important aspects in HPC. DataRush
solves this by hiding latency of each task, by having so many tasks to
work on. Sort of a Java version (weak analogy) of the old Tera MTA
system. Other codes like Hadoop could do similar things ... schedule so
much work that some actually gets done.
A nascent (yet very real) problem for Java in addition to the above
mentioned, for HPC usage going forward, is their complete lack of
support for accelerators. Maybe someday, in another decade or so, they
will start to support GPU computation ... not talking about OpenCL
support, but real execution on the many cores that accelerators supply.
The underlying architecture is changing fast enough that I don't think
they can keep up. And end users want the performance. This provides a
net incentive not to use Java, as it can't currently (or in the
foreseeable future) support the emerging personal supercomputing systems
with accelerators. Sure it can run on the CPUs, but then like all other
codes, it runs head first into the memory wall, the IO bandwidth walls,
and so forth.
Sun of course will claim that the trick is to massively multithread the
code, which means you don't focus on individual thread performance but
on overall throughput. Which somewhat flies in the face of what HPC
developers have been talking about for decades (tune for single
processors first, then for parallel).
So I won't disparage the users or use of Java in HPC, other than to note
that the future on that platform in HPC may not be as bright as some
marketeers might suggest.
N.B. the recent MPI class we gave suggested that we need to re-tool it
to focus more upon Fortran than C. There was no interest in Java from
the class I polled. Some researchers want to use Matlab for their work,
but most university computing facilities are loathe to spend the money
to get site licenses for Matlab. Unfortunate, as Matlab is a very cool
tool (been playing with it first in 1988 ...) its just not fast. The
folks at Interactive Supercomputing might be able to help with this with
their compiler.
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf
mailing list