[BioRuby] Update on phyloXML support for BioRuby project

Christian M Zmasek czmasek at burnham.org
Tue May 19 21:54:18 UTC 2009


Hi, Diana:

I think it is a good idea to have the parser return one tree at a time, 
as opposed to returning a list of trees.

On the other hand, the same does not apply to nodes. I think it is 
perfectly acceptable to expect to have enough memory to keep at least 
one tree in memory (a good target size might be a binary tree with 
ten-thousand external nodes and 200 bytes of annotation per node, which 
according to my rough calculations would require less than 5MB).

For your tree use cases, important ones to add are:
* iteration over all nodes
* retrieval/finding of specific nodes according to some criterion (e.g. 
find all nodes for which the species is "E. coli")
* tree reconciliation (e.g. compare a gene tree to a species tree, in 
order to determine duplications on the gene tree)

In any case, all these applications/algorithms will be most time 
efficient and easiest to implement with trees which are completely in 
memory.

Re. "I am a little confused about the require statements in BioRuby 
classes. It looks like bio/tree.rb should hold a general class, but it 
requires bio/db/newick.rb, but this file in turn requires bio/tree.rb."

I am not clear about your question about this. ;)

Christian


Diana Jaunzeikare wrote:
> Hi all,
>
> I want to update you on my thoughts about this project and I have some 
> questions.
>
> So, I think we have reached consensus that the best choice is 
> libxml2-ruby SAX based XML parser.
>
> Since BioRuby has Tree class ( http://bioruby.org/rdoc/) it seems 
> logical that the parser should return a Tree class object. By using 
> SAX parser we avoid the problem of having whole XML file in memory, 
> but still the phylogenetic trees can be very large, and it might be 
> too much to store whole thing as a tree object in memory. This could 
> be a little remediated by having a function next_tree (or 
> next_phylogeny) which would read one tree at a time if phyloXML file 
> has several of them (this is similar to BioPerl implementation). I 
> don't think the children nodes can be done in similar fashion. Since 
> SAX parses sequentially, to get next node (child one level down) in 
> the tree, whole subtree has to be parsed (in order to wait while there 
> is event for the end tag of that child), thus loosing on speed. Any 
> thoughts on this?
>
> Also the Tree class should be extended and added method 
> output_phyloXML since it has methods output_newick, output_nhx.
>
> I think in order to understand what should be returned after parsing 
> it would be useful to know how people use phylogenetic tree data. Here 
> are some I could come up,
> * visualize / print
> * calculate total branch length of a tree
> * query info about specific nodes
> * create consensus trees
> Any others?
>
> I am a little confused about the require statements in BioRuby 
> classes. It looks like bio/tree.rb should hold a general class, but it 
> requires bio/db/newick.rb, but this file in turn requires bio/tree.rb.
>
> Thanks,
>
> Diana
>
> Project Page: 
> https://www.nescent.org/wg_phyloinformatics/PhyloSoC:PhyloXML_support_in_BioRuby
>




More information about the BioRuby mailing list