Michiel Jan Laurens de Hoon
mdehoon at ims.u-tokyo.ac.jp
Sat Jun 14 01:11:20 EDT 2003
> Is there a way to do a k-means clustering (or other) based on a
> distance matrix, rather than on the gene expression vector data?
For k-means clustering, this is in general not possible, as you need to
recalculate the cluster centroids in order to get the distances between
clusters. Ditto for hierarchical clustering using pairwise
centroid-linkage. However, it is possible if the Euclidean distance is
used as a measure of similarity (instead of e.g. the Pearson
correlation), but I haven't implemented that.
For hierarchical clustering using pairwise single-, maximum-, or
average-linkage, the distance matrix is sufficient no matter which
distance measure is used. The hierarchical clustering routine in the
underlying C library actually allows you to pass in the distance matrix
without the original gene expression data.
The reason that I haven't made that available in the Python interface is
the fact that these matrices get quite large (e.g. for the Bacillus
subtilis genome, the > 4000 genes would lead to a matrix with > 16000000
elements). This matrix is symmetric, so actually we need to store only
half of that, which can be done easily in C using a ragged array but not
so easily in Python.
I assume that your protein data are smaller than that, or maybe you
don't care so much about the memory requirements. How do you store the
protein similarity data in Python? If it doesn't matter that the matrix
is stored inefficiently in Python, I can modify the Python/C interface
to let you pass in the distance matrix directly to the pairwise
Iddo Friedberg wrote:
> Dear Michiel,
> I just looked at the manual for Bio.Cluster (very well written, BTW). Is
> there a way to do a k-means clustering (or other) based on a distance
> matrix, rather than on the gene expression vector data? The data i am
> trying to cluster teh structural similarity of protein structure
> fragments, and as such already appears in the matrix form.
> Michiel Jan Laurens de Hoon wrote:
>> Dear biopython developers,
>> I have added Bio.Cluster to the Biopython CVS. Bio.Cluster contains
>> clustering techniques for gene expression data (hierarchical, k-means,
>> and SOMs); most routines are written in C with a Python wrapper. This
>> package also exists separately as Pycluster.
>> The Python and C source code is in Bio/Cluster; I have also added
>> Bio.Cluster to setup.py.
>> In case you want to try this package, there is a manual at
>> (replace "from Pycluster import *" by "from Bio.Cluster import *") and
>> a sample data set at
>> Please let me know if you find any problems with this package.
Michiel de Hoon, Assistant Professor
University of Tokyo, Institute of Medical Science
Human Genome Center
4-6-1 Shirokane-dai, Minato-ku
More information about the Biopython-dev