[Biopython-dev] Bio.Cluster

Fri Jul 18 12:23:08 EDT 2003

Thanks! That looks extremely useful.

One comment (just from reading your email, I haven't looked at this 
yet): if the data in the matrix is very sparse, then a NumPy array would 
seem redundant in the sense that most of it will be zeros, and the user 
wil be trying to pass a huge data structure, in which most of the data 
is superfluous. Or am I getting something wrong here?

Thanks!

Iddo

Michiel Jan Laurens de Hoon wrote:
> I have added an option to do hierarchical clustering based on the 
> distance matrix directly. The new version in in Biopython's CVS. To 
> apply hierarchical clustering to the gene expression data, use
> 
> treecluster(my_matrix, ...)
> 
> or
> 
> treecluster(data=my_matrix, ...)
> 
> To do hierarchical clustering on the distance matrix directly, use
> 
> treecluster(distancematrix=my_distance_matrix, ...)
> 
> where my_distance_matrix is a 2D Numpy array which is symmetric and has 
> zeros on the diagonal (though the code does not check for it). This 
> works for pairwise single-, maximum-, and average-linkage, but not for 
> pairwise centroid-linkage, for which you would need the original gene 
> expression data.
> 
> I had to make some modifications in the Python <-> C interface for this, 
> which tends to be error prone. If you find any bugs, please let me know.
> 
> --Michiel.
> 
> Iddo Friedberg wrote:
> 
>> Dear Michiel,
>>
>> I just looked at the manual for Bio.Cluster (very well written, BTW). 
>> Is there a way to do a k-means clustering (or other) based on a 
>> distance matrix, rather than on the gene expression vector data? The 
>> data i am trying to cluster teh structural similarity of protein 
>> structure fragments, and as such already appears in the matrix form.
>>
>> Thanks,
>>
>> ./I
>>
>>
>>
>> Michiel Jan Laurens de Hoon wrote:
>>
>>> Dear biopython developers,
>>>
>>> I have added Bio.Cluster to the Biopython CVS. Bio.Cluster contains 
>>> clustering techniques for gene expression data (hierarchical, 
>>> k-means, and SOMs); most routines are written in C with a Python 
>>> wrapper. This package also exists separately as Pycluster.
>>>
>>> The Python and C source code is in Bio/Cluster; I have also added 
>>> Bio.Cluster to setup.py.
>>>
>>> In case you want to try this package, there is a manual at
>>> http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf
>>> (replace "from Pycluster import *" by "from Bio.Cluster import *") 
>>> and a sample data set at
>>> http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/demo.txt.
>>> Please let me know if you find any problems with this package.
>>>
>>> --Michiel.
>>>
>>
> 

-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 646 3171
http://ffas.ljcrf.edu/~iddo