[BioPython] Spatial clustering
    Shu-Hsien Sheu 
    ssheu at post.harvard.edu
       
    Tue Oct 14 11:16:00 EDT 2003
    
    
  
Dear all,
thanks for all the inputs!
I am new to this field and came from a bio background so I am not that 
familiar with computer sciences. The project, however, was there for 1 
year and had shown great results for some enzymes we tested:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=14499612&dopt=Abstract
The basic idea is to use organic solvents as "probes" and use energy 
function to find the favorable minimums. We first used a simplex method 
with Van der Waals cancellation and then do the further minimization 
using CHARMm. Through some testing we've found a 6660 positions of the 
probes would bring the best results. By clustering those molecules and 
calculating the average free energy for each we can come up with top 5 
energy favorable clusters. It was shown that the "concensus" site of the 
clusters of different probes is the binding site of the protein.
Actually the cluster code is already there and is written in C. The 
person who wrote both the mapping program and the clustering program had 
already left this lab. Originally I was working on the concensus site 
finding part, which was done by manual inspection in Rasmal or PyMol in 
the past, but later thought that it might be more efficient if I wrap 
these two parts together. To me creating a valid RMSD matrix seems to be 
as important as the algorithym for clustering. For instance, the small 
molecules we used ranges from methanol to t-butanyl, and for the later 
two reference points might be needed. Finding the consensus sight might 
have more problems, since you are then dealing with different kinds of 
molecules. Any comments here?
Clustering seems to be an important issue when doing molecular 
modelling. People working on protein-protein docking in this lab all 
have some efforts in this though no collaborationg or a uniform method 
have been developed yet.
I have a naive questions about array/matrixes. Pairwise RMSD doesn't 
have direction, e.g. RMSD(1,2) == RMSD(2,1).
Therefore, the distance matrix would look like this:
      1      2    3      4      5
1   X     .2   .1    1.2   3.4
2   .2     X   .5      .2   .4
3    .1    .5    X     .6    .7
4   1.2   .2   .6      X    .2
5   3.4   .4   .7       .2    X
I've read the Numarray tutorial and there seems no special functions for 
matrixes that's symmetrical on the diagnol. Any more efficient approaches?
An algorithy in my mind is, starting with the RMSD matrix, first I would 
find the one with most neighbors, make it the hub of the cluster and 
take it out along with its memeber, then do the same thing recursively.
Dear Iddo,
I just checked cluto and would try to find if it's good for my purpose. 
thanks!
Dear Andrew,
I am not familiar with fingerprints or shape fiitting. Can you give me a 
place for start? I will search through google as well. I am not familiar 
with pharmacophore and will check it as well.
Dear Michiel,
I've read the PyCluster document and it seems that I had missed the 
point that the treecluster can let me specify the distance matrix 
myself. It might be the easiest solution. Thanks!
-shuhsien
    
    
More information about the BioPython
mailing list