[Biopython-dev] Bio.Cluster.Tree -> Bio.Phylo

Fri Apr 20 22:28:43 UTC 2012

Eric,

If my understanding is correct, UPGMA is slang for agglomerative 
average-linkage hierarchical clustering which is implemented along with 
single- and complete-linkage in the module. There's no equivalent of 
neighbor-joining or maximum-likelihood and Bio.Cluster probably isn't 
that fast with large numbers of nodes so wrappers are still useful. We 
could probably add an NJ implementation for small matrices pretty easily 
if you think it's worthwhile.

Either way, the glue could be useful for visualizing relationships 
between genes/samples in microarrays (what I gather Bio.Cluster is 
intended for).

Andrew

On 04/17/2012 11:25 AM, Eric Talevich wrote:
> Andrew,
>
> It would be useful to have a quick and portable function for
> distance-based tree estimation in Bio.Phylo, since otherwise it's
> necessary to use one of the wrappers for external programs in
> Bio.Phylo.Applications. (And currently, only PhyML is wrapped.) Does
> the hierarchical clustering algorithm in Bio.Cluster correspond to any
> common tree-estimation algorithm, e.g. UPGMA? If so, then it would
> make a lot of sense to provide the glue for using it that way. If you
> have done some work in this direction, I would be happy to see it.
>
> -Eric
>
>
> On Mon, Apr 16, 2012 at 6:47 PM, Andrew Sczesnak
> <andrew.sczesnak at med.nyu.edu>  wrote:
>> Eric,
>>
>> I can describe two use cases from my own experience. First, the MAF parser
>> I've been working on can pull the multiple alignment of some gene between a
>> bunch of genomes. Thinking of recipes for the cookbook, I thought it would
>> be neat to walk the user through constructing a distance matrix by hand
>> (though you're right--more could be done to support this), clustering with
>> Bio.Cluster and visualizing the result with Bio.Phylo. I like this example
>> because it integrates several different parts of BioPython along with a
>> lesson about inferring distances between sequences.
>>
>> Second, for another project, I've been generating distance matrices based on
>> the shared gene content of bacterial genomes and the presence-or-absence of
>> orthologous groups in each. Presently, I ferry the matrices to a clustering
>> program and then visualize the resulting trees in yet another tool. Looking
>> into ways of streamlining this brought me back to Bio.Cluster, Bio.Phylo and
>> the incompatibility of their tree objects.
>>
>> I wonder, what would be the most elegant way of bridging the gap?
>>
>>
>> Best,
>> Andrew
>>