[Biopython-dev] Added NN search

Thomas Hamelryck thamelry at vub.ac.be
Thu Jul 18 06:02:55 EDT 2002

Hi Andrew,

Thanks for the feedback on the KD tree.

> Is this really the case that it takes 3/4 of a minute to
> do a distance search on 10,000 points?

That was some time ago. I was also surprised that it was so
slow, so I took a good look at my algorithm and discovered that I was
examining waaaay to many tree nodes. It's much faster now. BTW, the 45s
was for finding all point pairs within radius r of each other, not for an
individual query.

Running a program that fills a 1x1x1 box with 100.000 random points and then
finds all point pairs within a radius of 0.01 (typically 20.000 pairs) now
takes 7s (time python __init__.py in the KDTree directory) on my 1 GHz.

> BTW, I had some problems understanding the API.  See the attachment
> where I commented out what doesn't work.

You have used the raw SWIG generated module! There's another layer on top of
that, which does more error checking etc. It should not be possible e.g. to
get segfaults with that module.

In short:

from Bio.Tools.KDTree import KDTree

kdt=KDTree(dim, bucketsize)


# individual queries
kdt.search(center, radius)

# al pairs

The __init__.py file in the KDTree dirctory now contains an example of howto
use the module.

> One of the things people want is to find the all the points in
> set X which are within distance r of any point in set Y.

True. I didn't really think of that.
For X small and Y large you can build a KD tree for Y and query the tree with
the positions in X.
For X and Y large you can lump them together, build a KD tree, do an
all-neighbors search and filter the results. Maybe I can come up with a nicer
solution for that last case.



Thomas Hamelryck      Vrije Universiteit Brussel (VUB)
Intitute for Molecular Biology            ULTR Department
Paardenstraat 65    1640 Sint-Gensius-Rode, Belgium

More information about the Biopython-dev mailing list