[BioRuby] GSOC: phyloXML for BioRuby: Profiling

Diana Jaunzeikare rozziite at gmail.com
Sun Jun 28 19:55:44 UTC 2009


Hi all,

I did some profiling of the code.

My system is Ubuntu 9.04, ruby 1.8.7 [i486-linux], Intel Core 2 Duo P8600
@2.4GHz

I created test_phyloxml_big.rb test file. It has test_next_tree method which
calls next_tree on the phyloxml file until end of file is reached. Here
follow results on the ncbi_taxonomy_mollusca.xml file which is 1.5MB large
with 5632 external nodes.

It takes around 7.5min to finish test_phyloxml_big.rb test.
(Finished in 443.231507 seconds.
Finished in 457.255576 seconds. )

output of the top:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
COMMAND
21222 diana     20   0 29020  25m 1928 R   95  0.9   5:32.92 ruby

So it looks like memory footprint is small ~ 25MBs. CPU usage is 95% (i have
two processors, so it is completely using one of them).

I did the same thing for tol_life_xml but it took forever to finish. (more
than 3 hours)


For curiosity I created a method next_tree_dummy. All it does is to
reader.read from file until it reaches </phylogeny> element.

tree of life xml (file size: 45.1MB) - Finished in 3.177743 seconds.

mollusca xml (1.5MB) -  Finished in 0.252993 seconds.

metazoa xml (32.3MB) -  Finished in 3.393467 seconds.

I think this shows that libxml is really fast.


I also did profiling with ruby-prof on ncbi mollusca taxonomy file.

Here is partial output:


diana at diana-ubuntu:~/bioruby$ ruby-prof -p graph
test/unit/bio/db/test_phyloxml_big.rb
Loaded suite /usr/bin/ruby-prof
Started
.
Finished in 1345.4039 seconds.

1 tests, 0 assertions, 0 failures, 0 errors
Thread ID: 3084157360
Total Time: 1257.6

[..]
-------------------------------------------------------------------------------
                   1257.56      0.60      0.00   1256.96
2/2     Bio::TestPhyloXMLBig#test_next_tree
 100.00%   0.05%   1257.56      0.60      0.00   1256.96
2     Bio::PhyloXML#next_tree
                      0.08      0.03      0.00      0.05
8107/16210     Bio::PhyloXML#parse_attributes
                      0.00      0.00      0.00      0.00
1/243188     String#==
                      0.00      0.00      0.00      0.00
8104/24322     LibXML::XML::Reader#[]
                      6.18      0.97      0.00      5.21
48616/48616     Bio::PhyloXML#parse_clade_elements
                      0.14      0.14      0.00      0.00
48623/97244     LibXML::XML::Reader#read
                      0.27      0.17      0.00      0.10
48644/875134     Bio::PhyloXML#is_element?
                      0.11      0.04      0.00      0.07
16206/32442     Class#new
                      0.04      0.04      0.00      0.00
16206/116034     Kernel#==
                      0.00      0.00      0.00      0.00
4/72929     Bio::PhyloXML#parse_simple_elements
                      0.07      0.01      0.00      0.06
8102/8102     Bio::Tree#add_node
                      0.44      0.31      0.00      0.13
97243/137758     Bio::PhyloXML#is_end_element?
                   1249.05      0.02      0.00   1249.03
8102/8102     Bio::Tree#parent
                      0.58      0.07      0.00      0.51
8102/8102     Bio::Tree#add_edge
-----------------------------------------------------------------------------
                   1249.05      0.02      0.00   1249.03
8102/8102     Bio::PhyloXML#next_tree
  99.32%   0.00%   1249.05      0.02      0.00   1249.03
8102     Bio::Tree#parent
                   1249.03      0.13      0.00   1248.90
8102/8102     Bio::Tree#path
                      0.00      0.00      0.00      0.00
8102/72975     Array#[]
--------------------------------------------------------------------------------
                   1249.03      0.13      0.00   1248.90
8102/8102     Bio::Tree#parent
  99.32%   0.01%   1249.03      0.13      0.00   1248.90
8102     Bio::Tree#path
                      0.04      0.01      0.00      0.03
16204/164638052     Hash#[]
                   1248.82      0.27      0.00   1248.55
8102/8102     Bio::Pathway#bfs_shortest_path
                      0.03      0.03      0.00      0.00
24306/116034     Kernel#==
                      0.01      0.01      0.00      0.00
16204/72975     Array#[]
--------------------------------------------------------------------------------
                   1248.82      0.27      0.00   1248.55
8102/8102     Bio::Tree#path
  99.30%   0.02%   1248.82      0.27      0.00   1248.55
8102     Bio::Pathway#bfs_shortest_path
                      0.26      0.19      0.00      0.07
142736/164638052     Hash#[]
                      0.07      0.07      0.00      0.00
75419/116034     Kernel#==
                   1248.18    115.50      0.00   1132.68
8102/8102     Bio::Pathway#breadth_first_search
                      0.04      0.04      0.00      0.00
67317/67330     Array#unshift
--------------------------------------------------------------------------------
                   1248.18    115.50      0.00   1132.68
8102/8102     Bio::Pathway#bfs_shortest_path
  99.25%   9.18%   1248.18    115.50      0.00   1132.68
8102     Bio::Pathway#breadth_first_search
                    136.52     92.65      0.00
43.8765785140/164638052     Hash#[]
                     22.53     22.53      0.00
0.0032900672/32900681     Array#shift
                    973.59    324.56      0.00
649.0332892570/32892570     Hash#each_key
                      0.04      0.03      0.00      0.01
24306/98702064     Hash#[]=
[..]


99.32% of the total time is spent in Bio::Tree#parent method and the methods
it calls.
Bio::Tree#parent calls Bio::Tree#path which calls
Bio::Pathways#bfs_shortest_path which in turn calls
Bio::Pathway#breadth_first_search (99.25% of total time is spent in this
method and its sub calls).

This was a huge surprise for me. Why would breadth first search be needed if
I just want to know the parent node of the current node.

The reason I am using Bio::Tree#parent is because I have to keep track of
the current node I am parsing. When I have reached </clade> element i set
the current_node to the parent of the node I just parsed.

I see here two options.

1) Keep track of the current node myself (by putting references in an array
and pushing and poping accordingly). Thus I won't have to call the
Bio::Tree#parent method.

2) Update Bio::Tree/ Bio::Node class so that nodes contain references to
their parents. (thus not needing to  call breadth first search).

What do you think?

Diana



More information about the BioRuby mailing list