[Bioperl-l] discussion of various Tree methods

Aaron J Mackey Aaron J. Mackey" <amackey@virginia.edu
Wed, 17 Jul 2002 13:45:34 -0400 (EDT)


If there was a previous discussion regarding these, please forgive my
ignorance of them (and maybe point me to the message ids) ...

There is a bit of confusion in the Tree and Node objects, especially
concerning "height": A Tree object thinks of height as the number of
"levels" of branching from the root to the nodes (for a balanced binary
tree, that's log2(number of nodes), which is how it's currently
implemented).  Nevermind the balanced binary tree issue for now.  A Node
object, on the other hand, thinks of height as the largest sum of
branchlengths from it to a terminal leaf.

I know that branch length is a ubiquitous phylogenetic term, and I'm not
proposing we turn our Tree objects into generic Computer Science tree
objects, but I think it would help if we found some slightly different
names for these two concepts: the first is a purely topological concept,
and has nothing to do with any values assigned to any nodes/branches.  I
propose to call that "depth".  The "depth" of a tree then is the longest
"depth" of a terminal leaf.  The "depth" of any internal or external node
(i.e. leaf) is the topological distance between it and the root (and for
balanced binary trees, the depth of all leaves happens to be log2(n)).

With the advent of "depth", I'm happy to leave "height" alone (although
I'm not sure I'd always want to use branch_length as the metric; sometimes
bootstrap, sometimes a mutational rate instead of time, etc; I guess I
could supplant the preexisting branch_length values with the ones I wanted
to use, but that requires me to be messy; in any case, it needs to be
updated to reflect that we can handle trees that aren't balanced binary
trees).

Once we have "depth", we can now do topological distance easily between
any two nodes via the "lca" operator:

sub distance {
    my ($t, $a, $b) = @_;
    return $a->depth + $b->depth - $t->lca($a, $b)->depth;
}

(Note that you could also do a "valued" distance by replacing the calls to
depth with calls to height, modulo the same arguments as above)

sub lca {
    my ($t, $a, $b) = @_;

    # this could be made faster with Tie::IxHash
    # or similar ordered hash logic

    my @a = ($a);
    push @a, $a while ($a = $a->ancestor);
    my @b = ($b);
    push @b, $b while ($b = $b->ancestor);
    while (@a && @b) {
        last if $a[0]->internal_id ne $b[0]->internal_id;
        ($a, $b) = (pop(@a), pop(@b));
    }
    return $a; # or $b, doesn't matter
}

Suggestions for improvement of lca() are welcome (I'd rather not calculate
the entire transitive closure for this, but a tc() method would be useful
on other occasions).

On another note, here are some of the tree topology descriptive statistics
and metrics I'd like to see:

stats on the whole tree:
------------------------
size (# of nodes)
resolution
balance
cic (cladistic information content)
bandwidth (of the matrix implied by the tree)

node metrics: (only calculated when asked for, then cached until topology changes)
-------------
depth (see above)
height (in any of its incarnations, see above)
Horton/Strahler index (degree of sub branching)
flow (contribution of a node to the overall tree)
bifurcation # (number of children)

And finally, some of the other tree methods I'd like to see:

@nodes = $tree->polytomies(); # returns the roots of any polytomies found
$tree->is_rooted(); # well is it, or isn't it? Never mind the implicit root.
$tree->reroot($node); # not just set_root_node, but rearrange topology
                      # to not lose old root (if present); branches from
                      # new root to old root need to "invert" direction
$tree->autoroot(); # like the "retree" program's "root" option.
$tree->collapse($nodeA => $nodeB); # delete nodeA and give all of it's
                                   # children to nodeB

Thanks for listening.

-Aaron

-- 
 Aaron J Mackey
 Pearson Laboratory
 University of Virginia
 (434) 924-2821
 amackey@virginia.edu