[Biopython-dev] GSoC Weekly Update 10: PhyloXML for Biopython

Eric Talevich eric.talevich at gmail.com
Mon Jul 27 17:56:40 UTC 2009


Hi folks,

Previously (July 20-24) I:

    Finished implementing I/O methods, Tree classes and tests for all
phyloXML
    elements.

    Changed Writer to preserve node order in the XML; output now validates
    under the phyloXML 1.00 schema (but 1.10 complains)

    Did some drastic code reorganization.
    - Bio.Tree:
        - Moved Clade.find() and PhyloElement.__repr__ methods to BaseTree
          classes
        - Made Clade inherit from BaseTree.Tree in addition to
BaseTree.Node,
          and added the corresponding attributes
        - Moved Bio.PhyloXML.Tree to Bio.Tree.PhyloXML

    - Bio.TreeIO:
        - Merged PhyloXML's Parser and Writer into PhyloXMLIO under the new
          Bio.TreeIO module, and updated imports everywhere
        - Added wrappers for Nexus read/write; doesn't return Bio.Tree
objects
          yet though

    Added/updated unit tests for all of this.

    Documented the code reorg on the Biopython wiki, adding Tree and TreeIO
    pages and fixing the examples on the PhyloXML page.

    Scrubbed docstrings and enabled epydoc processing.


This week (July 27-31) I will:

    Finish implementing the phyloXML spec:

    - Scan "simple types" for restricted tokens; check strings in
constructors
    - Take a stab at phyloXML 1.10 support (need a 'version' arg to Writer?)
    - Clean up and reorganize any code that needs it

    Enhancements (time permitting):

    - Improve the SeqRecord conversion
    - Work on Bio.Tree.BaseTree compatibility with BioSQL's PhyloDB
extension
    - Port common methods to Bio.Tree.BaseTree -- see Bio.Nexus.Tree,
Bioperl
      node objects, PyCogent, p4-phylogenetics
    - Tree method: build_index (set left_idx, right_idx on all nodes):
        - calculate left/right indexes for nested-set representation
        - see
http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html

    - Export to networkx (http://networkx.lanl.gov/) -- also get graphviz
export
      for free, via networkx.to_agraph()


Remarks:

    - Bioperl's phyloXML driver was written for version 1.00 and might hurl
if
      given a v1.10 file -- so that's a potential problem if Biopython
defaults
      to writing v1.10 files. Should Writer take a option to specify the
file
      format version number? Right now it only writes valid phyloXML v1.00.

    - PhyloXMLIO also always writes branch_length as an XML node, not an
      attribute. This validates and will be handled safely by any sane
parser,
      and fits better with the idea of an implicit root node in each clade
      object, I think. (The parser still handles an attribute properly.) Any
      objections?

    - Above, I've listed more enhancements than I'll probably be able to
finish
      this week. Which should have higher priority? I know merging Bio.Nexus
      and Bio.Tree would be the most useful, but since (1) Biopython
      development still happens on CVS, not Git, and (2) another Tree-based
      GSoC project is expected to land around the same time as mine, I think
      doing the integration right now would be kind of painful. So I can
focus
      either on laying the groundwork in Bio.Tree.BaseTree, copying rather
than
      moving the relevant Nexus code, or else work mainly on exporting to
other
      useful object representations like networkx graphs, or any Biopython
      classes I've missed (e.g. alignments). Suggestions?


Cheers,
Eric
http://github.com/etal/biopython/tree/phyloxml/Bio/PhyloXML
https://www.nescent.org/wg_phyloinformatics/PhyloSoC:Biopython_support_for_parsing_and_writing_phyloXML



More information about the Biopython-dev mailing list