[Bioperl-l] discussion of various Tree methods

Jason Stajich jason@cgt.mc.duke.edu
Wed, 17 Jul 2002 14:56:39 -0400 (EDT)


On Wed, 17 Jul 2002, Aaron J Mackey wrote:

>
> If there was a previous discussion regarding these, please forgive my
> ignorance of them (and maybe point me to the message ids) ...
>

Realizing there has been no discussion this - I implementing the tree
objects as I needed for coalescent simulation (see RandomTreeFactory).
Very glad to have discussion.

> There is a bit of confusion in the Tree and Node objects, especially
> concerning "height": A Tree object thinks of height as the number of
> "levels" of branching from the root to the nodes (for a balanced binary
> tree, that's log2(number of nodes), which is how it's currently
> implemented).  Nevermind the balanced binary tree issue for now.  A Node
> object, on the other hand, thinks of height as the largest sum of
> branchlengths from it to a terminal leaf.
>
> I know that branch length is a ubiquitous phylogenetic term, and I'm not
> proposing we turn our Tree objects into generic Computer Science tree
> objects, but I think it would help if we found some slightly different
> names for these two concepts: the first is a purely topological concept,
> and has nothing to do with any values assigned to any nodes/branches.  I
> propose to call that "depth".  The "depth" of a tree then is the longest
> "depth" of a terminal leaf.  The "depth" of any internal or external node
> (i.e. leaf) is the topological distance between it and the root (and for
> balanced binary trees, the depth of all leaves happens to be log2(n)).
>
> With the advent of "depth", I'm happy to leave "height" alone (although
> I'm not sure I'd always want to use branch_length as the metric; sometimes
> bootstrap, sometimes a mutational rate instead of time, etc; I guess I
> could supplant the preexisting branch_length values with the ones I wanted
> to use, but that requires me to be messy; in any case, it needs to be
> updated to reflect that we can handle trees that aren't balanced binary
> trees).

I needed something to get the nodes in order so I had used height, but
you're right this is very overloaded name.  depth seemse like a fine
compromise and I'd be happy for height to be changed to a more explicit
name and/or made dynamic so that different fields can be used to calculate
the node's 'height' from the leaves.

>
> Once we have "depth", we can now do topological distance easily between
> any two nodes via the "lca" operator:
>
> sub distance {
>     my ($t, $a, $b) = @_;
>     return $a->depth + $b->depth - $t->lca($a, $b)->depth;
> }
>
> (Note that you could also do a "valued" distance by replacing the calls to
> depth with calls to height, modulo the same arguments as above)
>

> sub lca {
>     my ($t, $a, $b) = @_;
>
>     # this could be made faster with Tie::IxHash
>     # or similar ordered hash logic
>
>     my @a = ($a);
>     push @a, $a while ($a = $a->ancestor);
>     my @b = ($b);
>     push @b, $b while ($b = $b->ancestor);
>     while (@a && @b) {
>         last if $a[0]->internal_id ne $b[0]->internal_id;
>         ($a, $b) = (pop(@a), pop(@b));
>     }
>     return $a; # or $b, doesn't matter
> }
>
> Suggestions for improvement of lca() are welcome (I'd rather not calculate
> the entire transitive closure for this, but a tc() method would be useful
> on other occasions).
>
> On another note, here are some of the tree topology descriptive statistics
> and metrics I'd like to see:
>
> stats on the whole tree:
> ------------------------
> size (# of nodes)
> resolution
> balance
> cic (cladistic information content)
> bandwidth (of the matrix implied by the tree)
>
> node metrics: (only calculated when asked for, then cached until topology changes)
> -------------
> depth (see above)
> height (in any of its incarnations, see above)
> Horton/Strahler index (degree of sub branching)
> flow (contribution of a node to the overall tree)
> bifurcation # (number of children)
>
> And finally, some of the other tree methods I'd like to see:
>
> @nodes = $tree->polytomies(); # returns the roots of any polytomies found
> $tree->is_rooted(); # well is it, or isn't it? Never mind the implicit root.
> $tree->reroot($node); # not just set_root_node, but rearrange topology
>                       # to not lose old root (if present); branches from
>                       # new root to old root need to "invert" direction
> $tree->autoroot(); # like the "retree" program's "root" option.
> $tree->collapse($nodeA => $nodeB); # delete nodeA and give all of it's
>                                    # children to nodeB
>
> Thanks for listening.
>

It all sounds quite useful and straight-forward to implement, let's go
with it.  Of course as you mentioned before it would be nice to start to
attatch sequences (of DNA/Pep or character states) to nodes as well so we
can do other things as well.  Can just derive a Node class which has some
slots for this information.

> -Aaron
>
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu