[BioRuby] GSOC PhyloXML profiling, bottleneck is Bio::Tree#children

Naohisa GOTO ngoto at gen-info.osaka-u.ac.jp
Mon Aug 10 02:58:12 UTC 2009


Hi,

On Fri, 7 Aug 2009 13:31:45 -0400
Diana Jaunzeikare <rozziite at gmail.com> wrote:

> Hi all,
> 
> Here is update on Google Summer of Code Bioruby PhyloXML project. I was
> profiling and refactoring Bioruby PhyloXML Parser code and got 67% speed
> increase.
> 
> With profiling PhyloXML Writer the story is different. It takes 24minutes to
> write the 1.5MB mollusca taxonomy tree and forever other larger files.
> Again the bottleneck is bfs_shortest_path, which is called from
> Tree#children method. It takes forever to just iterate over all the children
> nodes.

Bio::Tree uses graph algorithm  implemented in the Bio::Pathway class.
As you say, their implementations are naive and should be rewritten
in the future.

> To solve this I propose to save an array of the children of the node within
> my PhyloXML::Node (which corresponds to a clade) class. This would also
> ensure that when a phyloxml file is parsed and then written back, clades
> would be the same order in the input and output files.

How to maintain the array when modifying the tree, e.g. to add,
delete, or replace some nodes and edges?


Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org


> Have a good weekend,
> 
> Diana
> 



More information about the BioRuby mailing list