[Biopython-dev] Bio.Cluster

Sat Jun 14 01:11:20 EDT 2003

 > Is there a way to do a k-means clustering (or other) based on a
 > distance matrix, rather than on the gene expression vector data?

For k-means clustering, this is in general not possible, as you need to 
recalculate the cluster centroids in order to get the distances between 
clusters. Ditto for hierarchical clustering using pairwise 
centroid-linkage. However, it is possible if the Euclidean distance is 
used as a measure of similarity (instead of e.g. the Pearson 
correlation), but I haven't implemented that.

For hierarchical clustering using pairwise single-, maximum-, or 
average-linkage, the distance matrix is sufficient no matter which 
distance measure is used. The hierarchical clustering routine in the 
underlying C library actually allows you to pass in the distance matrix 
without the original gene expression data.

The reason that I haven't made that available in the Python interface is 
the fact that these matrices get quite large (e.g. for the Bacillus 
subtilis genome, the > 4000 genes would lead to a matrix with > 16000000 
elements). This matrix is symmetric, so actually we need to store only 
half of that, which can be done easily in C using a ragged array but not 
so easily in Python.

I assume that your protein data are smaller than that, or maybe you 
don't care so much about the memory requirements. How do you store the 
protein similarity data in Python? If it doesn't matter that the matrix 
is stored inefficiently in Python, I can modify the Python/C interface 
to let you pass in the distance matrix directly to the pairwise 
single/complete/average routines.

--Michiel.

Iddo Friedberg wrote:

> Dear Michiel,
> 
> I just looked at the manual for Bio.Cluster (very well written, BTW). Is 
> there a way to do a k-means clustering (or other) based on a distance 
> matrix, rather than on the gene expression vector data? The data i am 
> trying to cluster teh structural similarity of protein structure 
> fragments, and as such already appears in the matrix form.
> 
> Thanks,
> 
> ./I
> 
> 
> 
> Michiel Jan Laurens de Hoon wrote:
> 
>> Dear biopython developers,
>>
>> I have added Bio.Cluster to the Biopython CVS. Bio.Cluster contains 
>> clustering techniques for gene expression data (hierarchical, k-means, 
>> and SOMs); most routines are written in C with a Python wrapper. This 
>> package also exists separately as Pycluster.
>>
>> The Python and C source code is in Bio/Cluster; I have also added 
>> Bio.Cluster to setup.py.
>>
>> In case you want to try this package, there is a manual at
>> http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/cluster.pdf
>> (replace "from Pycluster import *" by "from Bio.Cluster import *") and 
>> a sample data set at
>> http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/demo.txt.
>> Please let me know if you find any problems with this package.
>>
>> --Michiel.
>>
> 

-- 
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
Tokyo 108-8639
Japan
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon