[BioRuby] BioRuby Phyloxml update

Diana Jaunzeikare djaunzei at smith.edu
Sun Nov 8 03:50:26 UTC 2009


Hi all,

So finally I have updated Bio::Tree and Bio::Node classes to improve
the phyloxml writer speed.

* Added Bio::Node::parent and  Bio::Node::children (array of nodes) in
order to avoid calling Tree::parent(node) or Tree::children(node),
because those methods call breath first search on the underlying
graph, which makes PhyloXML writer and parser incredibly slow. In
contrast, Bio::Node::parent and Bio::Node::children keeps references
to the respective nodes.
* Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
track of Node::parent and Node::children nodes correctly.  Have I
forgotten anything?
* Now for PhyloXML writer it takes less than 1 second instead of
~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
* To write the tree of life taxonomy file (~46MB) it takes 10 seconds
(On 2.4GHz, 2.9GB RAM, running Ubuntu)

The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class

I wrote unit tests for my changes and made sure my changes don't break
anything else. However, does anybody has code laying around that uses
Tree::parent and Tree::children methods so that I can test it more
thoroughly?

Cheers,
Diana



More information about the BioRuby mailing list