[Biopython-dev] [Wg-phyloinformatics] GSoC Weekly Update: PhyloXML for Biopython

Wed Jun 17 23:17:41 UTC 2009

Hi Brad,

Here's a mid-week update and partial response to your questions.

*SeqRecord transformation*

It would be nice if I could round-trip this sequence information perfectly,
so
that nothing's lost between reading and writing an arbitrary, valid PhyloXML
file. For that to work, PhyloXML.Sequence.from_seqrec() would need to look
at
SeqRecord.features and assume that any matching keys have the appropriate
PhyloXML meaning.

These are the keys that from_seqrec() would look for:
    location
    uri
    annotations
    domain_architecture

Do you see any risk of collision for those names? And for serialization,
would it be unwholesome to convert Annotation and DomainArchitecture objects
to a GFF-style dict-in-a-string? e.g. annotation="ref=foo;source=bar;..." --
it's another layer of parsing and kind of esoteric, but I can live with it.

*Profiling*

Christian also suggested an option to parse just the phylogenies with a
name or id matching a given string. I like that and I don't see any problem
with extending it to clades as well. It seems like a reasonable use case to
select a sub-tree from a complete phyloXML document and treat it as a
separate
phylogeny from then on. This can be supported by various methods for
selecting
portions of the tree, and a method on Clade for transforming the selection
into
a new Phylogeny instance (so the original can be safely deleted).

I did some profiling with the cProfile module, and it looks like most of the
time is being spent instantiating Clade and Taxonomy objects. (Also,
pretty_print is hugely inefficient, but that's less important.) I think I
can
speed up parsing and reduce memory usage by pulling the from_element methods
out of each class and using a separate Parser class to do that work.

About the 2GB figure I gave earlier for the full NCBI taxonomy -- I was just
looking at Ubuntu's system monitor, and Firefox and a few other things were
running at the same time, taking up about 800MB already. So the full NCBI
taxonomy actually takes up only 1.2GB or so, which isn't such a problem, and
I
think it will get smaller as I shrink down these PhyloXML classes.

Questions:
    - Do you know of a better way to profile Python code, or visualize it?
    - Have you used __slots__ to optimize classes? Do you recommend it?

And a few that don't fit anywhere else:

    - What sort of whole-tree operations would you want to do with these
      objects that you can't do with a Nexus or Newick tree? What other
formats
      would you want to convert to? I'm thinking of adding an Export module
      later if there's time, for lossy conversions like a graph for
networkx.

    - What's the most intuitive way to display a phylogenetic tree you've
      loaded into Biopython? Serialize as Nexus and open in TreeViewX?
Convert
      to a graph and send to matplotlib? Or, is there a module in
Bio.Graphics
      that can draw trees? (If not, should there be?)

Thanks,
Eric

On Wed, Jun 17, 2009 at 8:41 AM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Hi Eric;
> Nice update and thanks again for copying the Biopython development
> list on this.
>
> >  * Added to_seqrecord and from_seqrecord methods to the PhyloXML.Sequence
> > class
> >    -- getting Bio.SeqRecord to stand in for PhyloXML.Sequence entirely
> will
> >    require some more thought
>
> I'm looking forward to seeing how you decide to go forward with
> this. For the work I do on a day to day basis, a continual
> struggle involves establishing relationships between things to
> retrieve more information. For instance, a pair of nodes on a tree
> is interesting -- how would I find papers, experiments and other
> information associated with those sequences? It seems like Accession
> and the ref attribute of Annotation help establish these
> relationships.
>
> >  * Test-driven development kind of went out the window this week.
>
> Heh. It happens -- sounds sensible to have a clean up and
> documentation week this week; that will also help others who are
> interested dig into using it.
>
> >  * The unit tests I do have in place give some sense of memory and CPU
> usage.
> >    For the full NCBI taxonomy, memory usage climbs up above 2 GB with the
> >    read() function, which isn't a problem on this workstation but could
> be for
> >    others.
>
> Do you see an opportunity to offer iterating over clades instead of
> loading them all into memory for these larger trees? This would
> involve lazily loading subclades on request and would limit some
> functionality for querying the full tree without loading it all into
> memory.
>
> Another option is to offer some pruning ability as a tree is
> loading. For instance, if I am loading the whole NCBI taxonomy on a
> memory limited computer and only need the Angiosperm flowering plant
> part of the tree. In this case, you'd want to throw away all clades
> not under the clades of interest.
>
> These are probably fringe cases; just brainstorming some ideas.
>
> Thanks again,
> Brad
>