OT: computationally hard or not (algorithm question)

Sat Mar 24 11:17:23 PST 2001

On Fri, Mar 23, 2001 at 06:00:07PM -0500, Mark Hahn wrote:
[ . . . ]
> > Now, take any one of those dots, search for the 10 (some figure on
this
> > order of magnitude) dots which are closest to it, and order those by
>
> why 10?

Something along those lines. The result set should very easily
"human-parsable", visually, at a glance, more or less.

[ . . . ]
> there's also a large body of literature on analyzing similarity in
> data with many dimensions, each of which is 26 deep, and in analyzing
> data with many, many dimensions, each of which is 4 deep.
> (text and genetics if you didn't guess ;)

That sounds very interesting indeed! To spill the beans about what I'm
up to (it's a little embarassing, seeing that you supercomputing people
are into fairly serious stuff mostly; protein sequences, physics, data
mining etc.), it's an online community/game where I want to give users
the option to find "similar" avatars to theirs. Similarity is a
massively complex notion, psychologically, and ideally this would be a
problem that you'd attack via some AI strategy (expert systems, or
systems that "learn" similarity), but the process needs to be _fast_: If
a user comes to the site, outfits her/his avatar with attributes and
then hits the go button to look for potential buddies, something around
a second would be acceptable. This is bearing in mind that there might
be dozens of those queries running more or less simultaneously, or in
quick succession. Hence my thinking that you'd define a set of
properties according to which every attribute is rated. Consider
clothing: Every item would get a rating according to coolness, elegance
and sportiness (we played that through; this would be _way_ too narrow,
you'd need a quite a few more differentiated categories), which means
you're looking at a 3D space where each dimension might be only 3
dimensions deep (2 = very, 1 = kind of, 0 = not at all). Actually, I am
thinking now that I could do with a depth of only 3 or 4 but that I'd
still need around a dozen dimensions (we haven't fixed the property set
yet). 

Do you have a starting point from where to dive into this body of
literature by any chance?

> > dimensions if it turns out that I can't afford the hardware) when I want
> > the result set within a couple of seconds at the _very_ most?
> 
> the enclosed program run in around 7 seconds on my cheap duron/600,
> and is a fairly obvious implementation of what you described.
> it's also O(N^2) in the number of elements, but linear in the 
> length of each string.  the next obvious step would be to use a 
> kD tree, or something a little smarter than the quadratic rescanning
> of the array.

I don't read C too well but as far as I could make out, the program you
enclosed (great!) implements Robert's first suggestion to 

-------------------------- begin quote -------------------------- 
to pick an element (or a point), evaluate the distance of each element
in the elements from the element or point (storing the result in e.g.
distance[i]) and sort the indices of distance (not the list itself).
-------------------------- end quote -------------------------- 

I'll try that!

Many thanks,
Frank