[BioRuby] BioRuby Phyloxml update

Naohisa Goto ngoto at gen-info.osaka-u.ac.jp
Sun Nov 8 12:50:56 UTC 2009


Hi Diana,

I'm sorry that the changes cannot be accepted, because the
modification of existing Bio::Tree methods breaks things.
Bio::Tree does not want to have children/parent information
in nodes. One of the reasons is that it is difficult to keep
consistency when copying a tree. Nodes can be shared with two
or more trees when copying a tree by using "dup" or "clone"
method.

Normally, tests for existing classes shold not be modified
except when changing specification or the test's bug, because
they guarantee specification of the class. Adding new tests
are OK.

If you really want nodes to have parent/children information
in each node, please do so in only PhyloXML classes (though
I'm negative).  In this case, the problem is that reading phyloxml
data and write back again seems good, but it seems there are
currently no way to convert Bio::Tree to PhyloXML. Now, it seems
hard to convert Newick data to PhyloXML.

Now, to prepare to include your PhyloXML code in BioRuby, I'm working
on my branch. Some API changes will be made.
http://github.com/ngoto/bioruby/tree/incoming

Note that in your test code, argument order of assert_equal is wrong.
I've already fixed in my branch.
http://github.com/ngoto/bioruby/commit/a291af62ef262ee04f3a0e1b6415d4e256c56a94

> * Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
> track of Node::parent and Node::children nodes correctly.  Have I
> forgotten anything?

Changing root with tree.root=().

-- 
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


> Hi all,
> 
> So finally I have updated Bio::Tree and Bio::Node classes to improve
> the phyloxml writer speed.
> 
> * Added Bio::Node::parent and  Bio::Node::children (array of nodes) in
> order to avoid calling Tree::parent(node) or Tree::children(node),
> because those methods call breath first search on the underlying
> graph, which makes PhyloXML writer and parser incredibly slow. In
> contrast, Bio::Node::parent and Bio::Node::children keeps references
> to the respective nodes.
> * Updated  Tree::add_edge, Tree::clear_edge, Tree::remove_edge,
> Tree::remove_edge_if, Tree:remove_nonsense_nodes in order to keep
> track of Node::parent and Node::children nodes correctly.  Have I
> forgotten anything?
> * Now for PhyloXML writer it takes less than 1 second instead of
> ~20minutes to write ncbi_taxonomy_mollusca.xml file 1.5MB
> * To write the tree of life taxonomy file (~46MB) it takes 10 seconds
> (On 2.4GHz, 2.9GB RAM, running Ubuntu)
> 
> The code is in http://github.com/latvianlinuxgirl/bioruby/tree/tree_class
> 
> I wrote unit tests for my changes and made sure my changes don't break
> anything else. However, does anybody has code laying around that uses
> Tree::parent and Tree::children methods so that I can test it more
> thoroughly?
> 
> Cheers,
> Diana
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby




More information about the BioRuby mailing list