[Biopython-dev] Code review request for phyloxml branch

Thu Sep 24 03:48:49 UTC 2009

Folks,

I've fixed a couple of remaining issues in the Bio.Tree and Bio.TreeIO
modules and I'd like your opinion on what else should be done before merging
this into the mainline.

First, the wiki documentation for PhyloXML has an example pipeline showing
how to build a phylogeny in Biopython, from a raw protein sequence to a
lightly annotated phyloXML file.
http://biopython.org/wiki/PhyloXML#Example_pipeline

Does this look like right? I copied the first few steps from the official
docs.

The source code, for your review, is here:
http://github.com/etal/biopython/tree/phyloxml/Bio/Tree/
http://github.com/etal/biopython/tree/phyloxml/Bio/TreeIO/
http://github.com/etal/biopython/tree/phyloxml/Tests/test_PhyloXML.py

Discussion:

*TreeIO*
The read, parse, write and convert functions work essentially the same as in
SeqIO and AlignIO, for the formats 'newick', 'nexus' and 'phyloxml'. Issues:

(1) 'phyloxml' uses a different object representation than the other two, so
converting between those formats is not possible until Nexus.Trees is ported
over to Bio.Tree.

(2) NexusIO.write() just doesn't seem to work. I don't understand how to
make the original Nexus module write out trees that it didn't parse itself.
Help?

*Tree
*The BaseTree module is meant to be the basis for Newick trees eventually,
so I'd like to get the design right with the minimum number of public
methods:

(1) The find() function, named after the Unix utility that does the same
thing for directory trees, seems capable of all the iteration and filtering
necessary for locating data and automatically adding annotations to a tree.
There's a 'terminal' argument for selecting internal nodes, external nodes,
or both, and I think this means get_leaf_nodes() is unnecessary. I'm going
to remove it if no one protests.

(2) Should find() be based on depth_first_search or breadth_first_search
(not checked in yet)? DFS would potentially find a leaf node faster, but BFS
seems more common in phylogenetics. Note that iteration can easily be
reversed with the standard reversed() function, so we don't need extra
functions for those cases.

(3) I left room in each Node for the left and right indexes used by BioSQL's
nested-set representation. Now I'm doubting the utility of that -- any
Biopython function that uses those indexes would need to ensure that the
index is up to date, which seems tricky. Shall I remove all mention of the
nested-set representation, or try to support it fully?

(4) There's some mention in the literature of a relationship-matrix
representation for phylogenies. Does anyone here know how to work with this
representation, or know if it would let us perform complex calculations with
blinding speed behind the scenes? If so, should there be a function in
Bio.Tree.Utils to export a tree to a NumPy array represented this way?  If
not, I'll forget about it.

*Graphics*
I finally fixed the networkx/graphviz/matplotlib drawing to leave unlabeled
nodes inconspicuous, so the resulting graphic is much cleaner, perhaps even
usable. Plus, the nodes are now a pretty shade of blue. Still, it would be
nice to have a Reportlab-based module in Bio.Graphics to print phylogenies
in the way biologists are used to seeing them. Does anyone know of existing
code that could be borrowed for this? I looked at ETE (announced on the main
biopython list last week) and liked the examples, but it uses PyQt4 and a
standalone GUI for display, which is a substantial departure from the
Biopython way of doing things.

Best regards,
Eric