[Bioperl-l] SearchIO speed up

Sendu Bala bix at sendu.me.uk
Thu Aug 10 15:19:52 UTC 2006


I am aiming to solve Project priority list item 1.2.1 "Improve 
Bio::SearchIO speed...".

I have made changes that result in a certain speed improvement (up to 
5x) in parsing BLAST results (and most/all? other SearchIO parsers). The 
changes made were as minimal as possible. I did not touch the parsing 
code itself, did not make any API changes except for gain-of-function, 
and did not make any behaviour changes (except for speed!).
For this reason I don't think there would be any contention about the 
changes so I will just commit them...

...Except I need to know if the community considers the speed problem 
solved or not. More radical changes will make SearchIO even faster, eg. 
Chris Fields and Jason (if I interpret the Project priority list item 
correctly) have suggested an end to individual Hit and HSP objects, 
which become just data members of a Result-like object. Ideally I don't 
want to go down that route because we lose quite a bit of OO power; HSP 
objects in particular make important use of inheritance and we'd end up 
stuffing tens of duplicated-code methods into the Result object 
replacement. Ugh. Or having the Result-like object implement 4 different 
interfaces. Ugh. That said, judging by the results below, having hsps as 
pure hashes would result in a ~5x speedup all the time instead of only a 
1.5x speedup in worst case.

So can people do some of their own speed tests on typical, realistic 
blast-parsing jobs before and after I commit the changes? Then everyone 
can decide if more needs to be done. I don't think everyone should do 
the same test - more variety is good. Obviously each individual should 
do the same test before and after. Post your before tests here and I'll 
commit sometime this weekend. Thanks for your help!


My own speed tests
------------------
#Files used:
   'medium.blast' = 670k plain-text format blastn with 1 result
   'large.blast'  = 8.2MB plain-text format blastn with 5 results

# Best case scenario:
time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => 
"large.blast"); while ($result = $sio->next_result) { }'

Before:
         repeat 1    repeat 2   repeat 3   average
real    1m28.065s   1m28.305s  1m28.271s  1m28.214s
user    1m27.610s   1m27.850s  1m27.860s  1m27.773s
sys     0m0.440s    0m0.450s   0m0.410s   0m0.433s

After:
         repeat 1    repeat 2   repeat 3   average    speed up:
real    0m16.653s   0m16.711s  0m16.685s  0m16.683s  5.3x
user    0m16.510s   0m16.570s  0m16.580s  0m16.553s  5.3x
sys     0m0.130s    0m0.140s   0m0.110s   0m0.127s   3.4x


perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => 
"medium.blast"); while ($result = $sio->next_result) { }'

dprofpp -I before:
Total Elapsed Time = 5.309818 Seconds
   User+System Time = 5.209818 Seconds
Inclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
  131.   0.473  6.834      2   0.2365 3.4169 
Bio::SearchIO::blast::next_result
  110.   0.333  5.753  36512   0.0000 0.0002 
Bio::SearchIO::blast::end_element
  102.   0.315  5.326   2388   0.0001 0.0022 
Bio::SearchIO::SearchResultEventBuilder::end_hsp
  92.2   0.026  4.804   7266   0.0000 0.0007 
Bio::Factory::ObjectFactory::create_object
  91.4   0.250  4.765   2388   0.0001 0.0020 
Bio::Search::HSP::GenericHSP::new
  40.4   0.162  2.105   2388   0.0001 0.0009 
Bio::SeqFeature::SimilarityPair::new
  39.0   0.173  2.034   9552   0.0000 0.0002 
Bio::SeqFeature::Similarity::new

dprofpp -I after:
Total Elapsed Time = 1.090717 Seconds
   User+System Time = 1.490717 Seconds
Inclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
  98.2   0.333  1.465      2   0.1665 0.7324 
Bio::SearchIO::blast::next_result
  40.5   0.213  0.604  36512   0.0000 0.0000 
Bio::SearchIO::blast::end_element
  22.8       -  0.341  34022        - 0.0000  Bio::SearchIO::blast::element
  19.8   0.296  0.296   2388   0.0001 0.0001 
Bio::SearchIO::SearchResultEventBuilder::end_hsp
  15.7   0.156  0.235  41186   0.0000 0.0000 
Bio::SearchIO::blast::characters
  12.6   0.188  0.188 141495   0.0000 0.0000 
Bio::SearchIO::blast::in_element
  10.3   0.133  0.154  25795   0.0000 0.0000  Bio::Root::IO::_readline

Improvement: most of our time now spent in the SAX-like bits of the 
parser code, instead of creating HSPI objects. (I might streamline the 
SAX-like bits as well.)


# Medium case scenario:
time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => 
"large.blast"); while ($result = $sio->next_result) { while ($hit = 
$result->next_hit) { while ($hsp = $hit->next_hsp) { } } }'

Before:
         repeat 1    repeat 2   repeat 3   average
real    1m27.771s   1m27.996s  1m28.953s  1m28.240s
user    1m27.200s   1m27.540s  1m28.500s  1m27.747s
sys     0m0.560s    0m0.450s   0m0.450s   0m0.487s

After:
         repeat 1    repeat 2   repeat 3   average    speed up:
real    0m25.984s   0m25.996s  0m25.714s  0m25.898s  3.4x
user    0m25.780s   0m25.750s  0m25.510s  0m25.680s  3.4x
sys     0m0.210s    0m0.240s   0m0.200s   0m0.217s   2.2x


# Worse case scenario:
time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => 
"large.blast"); while ($result = $sio->next_result) { while ($hit = 
$result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query; 
$hsp->hit; } } }'

Before:
         repeat 1    repeat 2   repeat 3   average
real    1m28.104s   1m27.861s  1m27.970s  1m27.978s
user    1m27.600s   1m27.420s  1m27.420s  1m27.480s
sys     0m0.490s    0m0.430s   0m0.550s   0m0.490s

After:
         repeat 1    repeat 2   repeat 3   average    speed up:
real    0m57.167s   0m57.080s  0m56.903s  0m58.050s  1.5x
user    0m56.700s   0m56.770s  0m56.530s  0m56.667s  1.5x
sys     0m0.470s    0m0.310s   0m0.370s   0m0.383s   1.3x


perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => 
"medium.blast"); while ($result = $sio->next_result) { while ($hit = 
$result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query; 
$hsp->hit; } } }'

dprofpp -I before:
Total Elapsed Time = 9.767702 Seconds
   User+System Time = 9.577702 Seconds
Inclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
  115.   0.568 11.031      2   0.2838 5.5156 
Bio::SearchIO::blast::next_result

dprofpp -I after:
Total Elapsed Time = 6.521398 Seconds
   User+System Time = 6.571398 Seconds
Inclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
  61.2   0.087  4.023  11940   0.0000 0.0003 
Bio::Search::HSP::GenericHSP::query

Improvement: now we spend most of our time dealing with the thing we 
actually wanted from the parse, not dealing with the parsing process itself.



--------------------------------------------------------

Here are the changes involved in achieving the speed up:


Bio::Search::HSP::GenericHSP
----------------------------

# Implementation changes
Call to new() now calls no methods of its own; no work is done simply to 
create a GenericHSP. The code that was previously in new() has been 
moved to private methods that are called just-in-time, as the user 
desires to know certain information by manually calling HSPI methods.

Added new options to new() -hit_desc and -query_desc for setting the 
description text for the sequences.


Bio::Search::Hit::GenericHit
----------------------------

# API-CHANGES
new() has extra option -hsp_factory.

New method hsp_factory() which gets/sets a Bio::Factory::ObjectFactoryI.

add_hsp() can now accept a hash ref instead of just a HSPI.

# Implementation changes
next_hsp() and hsps() convert hash ref hsp data to HSPI objects using 
the hsp_factory() as necessary.

# Notes
num_hsps() claimed to throw if there were no HSPs, but returns '-'. 
Updated docs.


Bio::Search::Result::GenericResult
----------------------------------

# API-CHANGES
new() has extra option -hit_factory.

New method hit_factory() which gets/sets a Bio::Factory::ObjectFactoryI.

add_hit() can now accept a hash ref instead of just a HitI.

# Implementation changes
next_hit() and hits() convert hash ref hit data to HitI objects using 
the hit_factory() as necessary.


Bio::Search::Iteration::GenericIteration
----------------------------------------

# API-CHANGES
new() has extra option -hit_factory.

New method hit_factory() which gets/sets a Bio::Factory::ObjectFactoryI.

add_hit() can now accept a hash ref instead of just a HitI.

# Implementation changes
Various methods convert hash ref hit data to HitI objects using the 
hit_factory() as necessary.


Bio::SearchIO::SearchResultEventBuilder
---------------------------------------

# Implementation changes
Methods end_hsp() and end_hit() return hash refs containing data 
suitable for creating HitI and HSPI objects respectively.


Bio::SearchIO::IteratedSearchResultEventBuilder
-----------------------------------------------

# Implementation changes
_add_hit() deals with the new way hit information is stored.

end_iteration() supplies the hit factory to created and returned 
iteration factories.


Bio::SeqFeature::SimilarityPair
-------------------------------

# Notes
new() makes an object factory if not supplied -feature_factory as arg, 
but then does nothing with its freshly created factory, and ignores any 
that are supplied with -feature_factory. Now it just doesn't make a 
factory at all, to save time.

new() has an option -feature1, but when supplied this is never used or 
set. I've left this alone.



More information about the Bioperl-l mailing list