[Biopython-dev] Biopython 1.00a2 release

Jeffrey Chang jchang at SMI.Stanford.EDU
Thu Jun 28 14:28:36 EDT 2001


At 9:27 AM +0200 6/28/01, Thomas Sicheritz-Ponten wrote:

>- one observation for calculating the euclidean distance, looping over the
>   vectors seems to be faster than the vector based operation .... the most
>   frequent used function in the kMean algoritm is the euclidean distance
>   measure, so we gain significant speed ... (at least on my machine)
>  (any thoughts about implementing the bi-secting kmeans algoritm ?)

I'm not familiar with the bisecting algorithm, but it sounds cool!  ;)


># this is slow
>def EucDist2(v1, v2):
>     return sqrt(sum((v1-v2)**2))
>
># this is faster
>def EucDist1(v1, v2):
>     sum = 0
>     for i in range(0,len(v1)):
>         sum += (v1[i] -v2[i])**2
>     return sqrt(sum)


Wow, that's surprising.  In EucDist2, the (v1-v2) work is getting
pushed down into Numeric, which loops through the array in C code.
However, there is some overhead because the results of v1-v2 gets
instantiated as a new list, and then squared, and then summed.

EucDist1 has less overhead because the calculations are done on the
fly, although in Python code.  Indexing, i.e. v1[i], does generate a
new object, but because of the semantics of Numeric, doesn't copy the
values from the original array.  However, I would have expected the
overhead from looping through the python code to offset that.

On my system, EucDist2 is about 10x faster than EucDist1.

[krusty:~] jchang% python test.py
EucDist1 time: 14.2610449791
EucDist1 31622.7766017
EucDist2 time: 1.84816789627
EucDist2 31622.7766017
[krusty:~] jchang%

This is running on Darwin-ppc, Python 2.1, Numeric 20.0.0.  I get
similar results on solaris-sparc, Python 2.1, Numeric 17.3.

So, the behavior you're seeing must be system dependent, which means
we should provide both implementations of euclidean distance and let
people choose which one to use.

Jeff



[krusty:~] jchang% cat test.py
import time
from Numeric import *

# this is slow
def EucDist2(v1, v2):
     return sqrt(sum((v1-v2)**2))

# this is faster
def EucDist1(v1, v2):
     sum = 0
     for i in range(0,len(v1)):
         sum += (v1[i] -v2[i])**2
     return sqrt(sum)

v1 = map(float, range(1000))
v2 = map(float, range(1000, 2000))
av1 = array(v1, Float32)
av2 = array(v2, Float32)

NTIMES = 1000

start = time.time()
for i in range(NTIMES):
     EucDist1(av1, av2)
t = time.time() - start
print "EucDist1 time:", t
print "EucDist1", EucDist1(av1, av2)

start = time.time()
for i in range(NTIMES):
     EucDist2(av1, av2)
t = time.time() - start
print "EucDist2 time:", t
print "EucDist2", EucDist2(av1, av2)




More information about the Biopython-dev mailing list