Neural Network applications using Beowulf

Wed Nov 20 06:42:29 PST 2002

On Thu, 14 Nov 2002, Eray Ozkural wrote:

> On Tuesday 12 November 2002 06:24, Robert G. Brown wrote:
> > I actually think that there is room to do a whole lot of interesting
> > research on this in the realm of Real Computer Science.
> >
> > Too bad I'm a physicist...;-)
> 
> Note that most artificial neural network applications don't fall in the realm 
> of supercomputing since they would be best suited to hardware 
> implementations, or more commonly, serial software.
> 
> We had discussed this with colleagues back at bilkent cs department and we 
> could not find great research opportunities in this area. It is a little 
> similar to stuff like parallel DFA/NFA systems. You first need an application 
> to prove that there is need for problems of that magnitude (more than what a 
> serial computer could solve!). What good is a supercomputer for an artificial 
> neural network that is comprised of just 20 nodes?
> 
> If of course somebody showed an application that did demand the power of a 
> supercomputer it would be very different, then we would get all of our 
> combinatorial tools to partition the computational space and parallelize 
> whatever algorithm there is :)
> 
> Neural networks being Turing-complete, I assume such a network would bear an 
> arrangement radically different from the "multi-layer feed-forward" networks 
> that EE people seem to be obsessed with. I have lost my interest in that area 
> since they don't seem to demand parallel systems and they are not 
> biologically plausible.

Hmm, I should probably take this offline, but I'll risk one more round
of replies on list since the topic does come up from time to time and is
not, I think, devoid of interest:-)

Let me itemize a few points:

  a) NN's (both MLFFBP NNs and their more exotic cousins) are extremely
useful in lots of problems.  As generalized multivariate nonlinear
function approximators, they can do things that are remarkably difficult
to do in other bases, such as infer/fit high dimensional (probability)
distribution functions from noisy data where the distribution(s) have
highly nontrivial high dimensional correlations.  In fact, for some of
these problems they are one of the only games in town -- one can
sometimes do better with specific problems if one has a Statistics
Ph.D., armed with advanced tools, study the problem for a year and try
to solve it, but NNs tend to produce comparable results with a lot less
expertise because they don't care about things like covariance in the
inputs.

  b) NN's can be worth a lot of money.  Many of the problems they can be
applied to in pattern recognition/predictive modelling alone can yield
really significant marginal gains over e.g. outer-product logistic
regression (totally abused as the universal hammer for multivariate
predictive models by corporations and health researchers alike -- turn
this dial, sweep up this probability in a smooth S... :-p

  c) There are two distinct components of the applications of NN's to
real world problems.  Both can be parallelized, but quite differently.

    i) Application of an existing trained net to a large dataset for
e.g.  classification.  This can be parallelized trivially by
distributing the dataset (or partitions thereof) and the net in
question, although NN application is typically so fast that it is hardly
worth it.  More likely useful in a HA situation, where a web site was
attempting real-time classification of lots of incoming threads of data
on a farm of classifier nodes, or in a simple distributed resource
situation, where the network is installed on all the desktop systems to
be applied by hand in parallel by worker bees seeking to classify
particular datasets of interest to them.

   ii) Creation of a trained network from data.  This is where things
get interesting, as networks are complex systems (in the literal, Santa
Fe institute sense) and hence highly nontrivial to create with high
nontriviality that scales amazingly poorly with input dimension.

  d) To my direct experience, the ONLY way to get good results in
problems with high input dimensionality is to eschew anything like
"simple" back propagation/regression or gradient descent for network
creation and insert a preliminary genetic algorithm optimization step.
GA's are LIKEWISE complex systems (which routinely and easily become
trapped in "inbred" locally stable optima far from -- and not in the
same valley as -- the real optima).  

FWIW, I have read some things that at least suggest that at least
trivial GA's (the natural selection part) are used biologically in the
development of real wetware -- used/successful pathways tend to be
selected for and reinforced, unused pathways are eliminated or recycled,
although there may not be crossover and so forth.

So when I talk about NN's being interesting, I mean specifically that
training/creating NNs for problems with high nontrivial input
dimensionality, where "high" means literally "as high as we can
currently afford to compute at all" (an algorithm dependent,
implementation dependent, Moore's Law dependent limit), fronted by a GA
that is ITSELF generally the rate limiting step (although conjugate
gradient "finishing" of the best network found with the GA can take a
VERY long time as the valley's one is descending tend to be shallow,
tortuous in many dimensions, and long) is an interesting problem where
clever parallelization could concievably lead to real advances in both
the largest problems we can tackle at all AND in the time required to
solve problems that aren't this big but where an answer is valuable in
as little time as possible.

Parallelizing the GA is one area where I think really interesting things
can be done.  The (general/simple) GA suffers from a number of known
problems -- the sorting process is limited to "gene" values available in
the initial pool plus whatever you can introduce by mutation and local
gradient methodologies without destabilizing the algorithm altogether;
with small populations it converges rapidly to an inbred population that
is pretty much the best it can do, subject to aforementioned mutational
or gradient descent drift, with large populations it takes a long time
to converge at ALL; plus a slew of technical details associated with the
optimal ordering of genes in a linear vector, with various crossover
schema, even with the optimal number of sexes.

Nature, of course, does genetic optimization in stochastic parallel all
the time, so GA's are inherently parallelizable, and developing them IN
a parallel environment provides one with the opportunity to see if
algorithms that are "only" efficient in such an environment, with a full
degree more of algorithmic complexity than a serial GA, can break
through some of the limitations of simple GAs and extend the
aforementioned limits in useful and valuable ways.  This would benefit
NNs as well as many other fields that derive useful local optimax from a
GA.

Finally, there are still SOME questions associated with high dimensional
NNs (including mere classifiers, but certainly extending toward more
interesting constructs that simulate "intelligence") that literally
cannot be sensibly answered -- yet -- because of their complexity and
the time required to build them.  I think that it is very much
worthwhile to meditate upon these questions, as well, while mucking
about with parallel GAs per se.  At the very least, it might be
worthwhile to consider the ratio of computation to communication for
various arrangements of e.g. parallelized layer serialization.  If one
thinks that real brains (not unreasonably) both do lots of things in
parallel AND have all sorts of feedback loops, it could easily be
worthwhile to implement models in "hardware" with task/layer specific
nodes even at the expense of serial implementation speed, especially
when faced with the daunting task of simultaneously training all the
"independent" layer nodes with inputs that are also varying as the
preceding layers are being optimized.  

I don't think it is quite time to say that we know all here and that
there is nothing "interesting" left to do -- I think that it is truer to
say that we've discovered the "transistor" -- a simple NN in fact can be
a very good analog of a transistor and could even doubtless be trained
to precisely emulate a transistor's I/O characteristics:-) -- but are
totally clueless about how to wire lots of transistors together to get
useful work done on complex tasks.  

The CPU on which I'm typing this reply IS isomorphic to a particular NN.
It is an NN that so far we have proven utterly incapable of "growing" --
it has to be microscopically designed and is built up of concatenations
of individually designed functional components (where we cannot yet
"grow" a NN that replicates any but the simplest of the components).
And, as you've noted, the in-parallel, genetically optimized NN that I'm
using to DIRECT my fingers as they type was grown, and yet is (in
principle, not practice:-) capable of designing the CPU.

Surely it might be capable of improving and parallelizing algorithms we
might use to build NN's and organize them so that they can accomplish
complex tasks, without having to basically "design the CPU" by hand...

Then there is the issue of time and dynamic networks.  "Most" NN's in
application are utterly, boringly, Markovian in both their inputs and
their processing.  Yet in reality, nearly anything of interest (such as
the good old CPU and my wetware) is horribly, overwhelmingly
non-Markovian in operation, with both input and state "memory" both
internal and external that totally alters output in multiple
applications of the "same" local time input.  There seems to me to be an
inherently parallel, or maybe loop-serial character to the training
process of non-Markovian networks, where one may have a non-Markovian
sequence in the training cycle that can be performed in serialized
parallel (node A optimizes a layer passed to node B, which optimizes a
layer passed to node C..., where B, C ... have outputs that are fed back
to node A for the next full pass of optimization, all in parallel).

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu