[Biopython-dev] Bio.Phylo: the home stretch

Eric Talevich eric.talevich at gmail.com
Sat Apr 17 13:35:57 UTC 2010


Hi all,

There are two more decisions in Bio.Phylo that I'd like to settle on before
the release of Biopython 1.54. They're holding open Bug 3045:
http://bugzilla.open-bio.org/show_bug.cgi?id=3045


1. *Do we need a get_all_clades() method on trees and clades?*

Bio.Nexus has get_terminals(); I added the same to Bio.Phylo early on, and
then get_nonterminals() to satisfy some demand for the opposite method:

    def get_terminals(self, order='preorder'):
        """Get a list of all of this tree's terminal (leaf) nodes."""
        return list(self.find_clades(terminal=True, order=order))

    def get_nonterminals(self, order='preorder'):
        """Get a list of all of this tree's nonterminal (internal) nodes."""
        return list(self.find_clades(terminal=False, order=order))

They're both trivial, but the idea is to make the module easy to jump into
without reading the docs first. (find_clades() is a generator function that
several other functions use internally; to do useful things in Bio.Phylo you
still need to learn how to use it eventually.)

So (a) do we need yet another sugar function that retrieves all tree nodes,
both internal and external? (b) if so, what should it be called?

The implementation would be:    list(self.find_clades(order=order))
Also accomplished as:    tree.get_terminals() + tree.get_nonterminals()



2. *Rename find_clades() to find(), or something else?*

I've previously renamed:

find() => find_any()
-- given the same parameters as find_clades(), return the first match found,
or else None (useful in an if statement)

find_all() => find_elements()
-- phyloXML trees have some complex objects as tree attributes, containing
other objects. This function searches for those directly, and for trees
without such attributes (e.g. all Newick trees), this happens to be the same
as find_clades()

So: find_clades() can search inside complex objects attached to trees, but
yields the corresponding clade object rather than the non-clade element
itself. This lets you search clades by e.g. clade.taxonomy.scientific_name,
or clade.sequence.type. It should be the first "find_*" function users reach
for. Should we give it a shorter name to encourage that, and shorten the
code that uses it?


Here's a first crack at documentation:
http://github.com/etal/biopython/commit/8056a198804a08e3e03ac943c45744ad020dd53f


Thanks,
Eric



More information about the Biopython-dev mailing list