[Biopython-dev] [Wg-phyloinformatics] BioGeography update

Eric Talevich eric.talevich at gmail.com
Wed Jul 8 04:09:43 UTC 2009

On Tue, Jul 7, 2009 at 11:12 AM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> Eric wrote:
> > My impression is that most tree representations are based on a recursive
> Node element with a few associated attributes and a number of useful
> > methods; phyloXML has a Clade object roughly corresponding to that,
> > but also a bunch of other element types for extensive annotation of
> > the tree. So two options spring to mind:
> >
> > 1. Let the Bio.PhyloXML.Tree objects be a superset of everything needed
> by
> > any phylogenetic tree representation, ever. (It's already pretty close.)
> > Refactor Nexus and Newick to use these objects; merge the features of
> > lagrange so the rest of the Biopython environment can benefit. Only
> export
> > to external object structures that are something other than a straight
> > phylogenetic tree -- e.g. networkx or graphviz for plotting, numpy/scipy
> for
> > crunching.
> >
> > 2. Factor a simple tree structure out of lagrange and Bio.Nexus, and let
> > that be the Biopython default representation. Add a function in
> Bio.PhyloXML
> > to export its enhanced tree structure to this simpler Bio.Tree
> > representation.
> I am unclear why would you need to have to have an entirely separate tree
> object structure (which then requires code to map between the two).
> Perhaps some specific examples of the "enhancements" would help?

The benefit of letting the tree object structures diverge is procrastination
-- we could reconcile the two modules after GSoC is over, with stable
features and test suites in place. But I could justifiably focus on
integration for the remaining weeks if that's best for Biopython, since
otherwise I'd probably be reimplementing a number of features already
present in other modules.

How about this variation on (2):
> Suppose Bio.Tree provided a simple tree object (holding a nested
> structure),
> with methods/functions for general operations like DFT, finding common
> ancestors, calculating branch lengths, collapsing internal nodes, etc.
> [and I would expect a lot of this could be borrowed from Bio.Nexus,
> and/or Thomas Mailund's Newick module]. Couldn't Bio.PhyloXML build
> on this using subclassed tree nodes?

The Bioperl and Bioruby phyloXML projects were done this way, I think, but
they already had access to Tree/Node objects within each project.
Bio.PhyloXML.Tree objects could inherit from Bio.Tree objects if the
Bio.Tree objects were designed in a compatible way... if we go this route
I'll need to draft up a list of traps, like naming conventions
("annotations" is already an attribute of Bio.PhyloXML.Sequence) and class
hierarchy (some functions rely on everything in the phyloXML spec being a
subclass of PhyloElement).

Do we even need different objects? What if each node class had an optional
> python dictionary for annotations? You could maybe key this off the
> PhyloXML
> names?
I bet this could be done without different objects. Bio.PhyloXML.Tree could
be moved to Bio.Tree or Bio.Tree.Elements; the base class PhyloElement could
be renamed to TreeElement; and the Nexus and Newick parsers could reuse
PhyloXML's Phylogeny and Clade elements, where Clade merges with the
existing Node class(es). Even Clade by itself might be enough. For
organizational purposes, format-specific tree elements could move to their
own files (Bio.Tree.PhyloElement.py, Bio.Tree.NexusElement.py), or some
multiple-inheritance tricks could be used to smooth things over.

Here is the phyloXML definitions of Clade:

My implementation (trimmed):

class Clade(PhyloElement):
    """Describes a branch of the current phylogenetic tree.

    Used recursively, describes the topology of a phylogenetic tree.

    The parent branch length of a clade can be described with the
    'branch_length' attribute.
    Element 'confidence' is used to indicate the support for a clade/parent
    Element 'events' is used to describe such events as gene-duplications at
    root node/parent branch of a clade.
    Element 'width' is the branch width for this clade (including parent
    branch). Both 'color' and 'width' elements apply for the whole clade
    overwritten in-sub clades.
    def __init__(self, branch_length=None, id_source=None,
            name=None, width=None, color=None, node_id=None, events=None,
            binary_characters=None, date=None,
            # Collections
            confidences=None, taxonomies=None, sequences=None,
            distributions=None, references=None, properties=None,
        # set all keyword arguments to instance attributes; collections
default to [] ...

The same for Phylogeny:

class Phylogeny(PhyloElement):
    """A phylogenetic tree."""
    def __init__(self, rooted,
            rerootable=None, branch_length_unit=None, type=None,
            name=None, id=None, description=None, date=None, clade=None,
            # Collections
            confidences=None, clade_relations=None, sequence_relations=None,
            properties=None, other=None,
        assert isinstance(rooted, bool)
        # set keyword arguments to attributes; collections default to [] ...


If we base the Bio.Tree objects off of these two classes, then I wouldn't
even need an optional annotations dictionary on each object. Which makes
sense, since I think the phyloXML format was designed to accommodate nearly
all types of annotations that could reasonably be applied to phylogenetic
trees. Assuming most of the Newick and Nexus annotations fit into this
design, if a small number of annotations don't, they can be added to this
constructor as more keyword arguments without much harm. (I know nothing
about NeXML; should we keep an eye on that too? Glance at the homepage I
don't see much about complex annotation types, which is probably good if we
want to fit that format into this framework eventually.)


More information about the Biopython-dev mailing list