[Bioperl-l] SearchIO speed up
Sendu Bala
bix at sendu.me.uk
Thu Aug 10 15:19:52 UTC 2006
I am aiming to solve Project priority list item 1.2.1 "Improve
Bio::SearchIO speed...".
I have made changes that result in a certain speed improvement (up to
5x) in parsing BLAST results (and most/all? other SearchIO parsers). The
changes made were as minimal as possible. I did not touch the parsing
code itself, did not make any API changes except for gain-of-function,
and did not make any behaviour changes (except for speed!).
For this reason I don't think there would be any contention about the
changes so I will just commit them...
...Except I need to know if the community considers the speed problem
solved or not. More radical changes will make SearchIO even faster, eg.
Chris Fields and Jason (if I interpret the Project priority list item
correctly) have suggested an end to individual Hit and HSP objects,
which become just data members of a Result-like object. Ideally I don't
want to go down that route because we lose quite a bit of OO power; HSP
objects in particular make important use of inheritance and we'd end up
stuffing tens of duplicated-code methods into the Result object
replacement. Ugh. Or having the Result-like object implement 4 different
interfaces. Ugh. That said, judging by the results below, having hsps as
pure hashes would result in a ~5x speedup all the time instead of only a
1.5x speedup in worst case.
So can people do some of their own speed tests on typical, realistic
blast-parsing jobs before and after I commit the changes? Then everyone
can decide if more needs to be done. I don't think everyone should do
the same test - more variety is good. Obviously each individual should
do the same test before and after. Post your before tests here and I'll
commit sometime this weekend. Thanks for your help!
My own speed tests
------------------
#Files used:
'medium.blast' = 670k plain-text format blastn with 1 result
'large.blast' = 8.2MB plain-text format blastn with 5 results
# Best case scenario:
time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
"large.blast"); while ($result = $sio->next_result) { }'
Before:
repeat 1 repeat 2 repeat 3 average
real 1m28.065s 1m28.305s 1m28.271s 1m28.214s
user 1m27.610s 1m27.850s 1m27.860s 1m27.773s
sys 0m0.440s 0m0.450s 0m0.410s 0m0.433s
After:
repeat 1 repeat 2 repeat 3 average speed up:
real 0m16.653s 0m16.711s 0m16.685s 0m16.683s 5.3x
user 0m16.510s 0m16.570s 0m16.580s 0m16.553s 5.3x
sys 0m0.130s 0m0.140s 0m0.110s 0m0.127s 3.4x
perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
"medium.blast"); while ($result = $sio->next_result) { }'
dprofpp -I before:
Total Elapsed Time = 5.309818 Seconds
User+System Time = 5.209818 Seconds
Inclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
131. 0.473 6.834 2 0.2365 3.4169
Bio::SearchIO::blast::next_result
110. 0.333 5.753 36512 0.0000 0.0002
Bio::SearchIO::blast::end_element
102. 0.315 5.326 2388 0.0001 0.0022
Bio::SearchIO::SearchResultEventBuilder::end_hsp
92.2 0.026 4.804 7266 0.0000 0.0007
Bio::Factory::ObjectFactory::create_object
91.4 0.250 4.765 2388 0.0001 0.0020
Bio::Search::HSP::GenericHSP::new
40.4 0.162 2.105 2388 0.0001 0.0009
Bio::SeqFeature::SimilarityPair::new
39.0 0.173 2.034 9552 0.0000 0.0002
Bio::SeqFeature::Similarity::new
dprofpp -I after:
Total Elapsed Time = 1.090717 Seconds
User+System Time = 1.490717 Seconds
Inclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
98.2 0.333 1.465 2 0.1665 0.7324
Bio::SearchIO::blast::next_result
40.5 0.213 0.604 36512 0.0000 0.0000
Bio::SearchIO::blast::end_element
22.8 - 0.341 34022 - 0.0000 Bio::SearchIO::blast::element
19.8 0.296 0.296 2388 0.0001 0.0001
Bio::SearchIO::SearchResultEventBuilder::end_hsp
15.7 0.156 0.235 41186 0.0000 0.0000
Bio::SearchIO::blast::characters
12.6 0.188 0.188 141495 0.0000 0.0000
Bio::SearchIO::blast::in_element
10.3 0.133 0.154 25795 0.0000 0.0000 Bio::Root::IO::_readline
Improvement: most of our time now spent in the SAX-like bits of the
parser code, instead of creating HSPI objects. (I might streamline the
SAX-like bits as well.)
# Medium case scenario:
time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
"large.blast"); while ($result = $sio->next_result) { while ($hit =
$result->next_hit) { while ($hsp = $hit->next_hsp) { } } }'
Before:
repeat 1 repeat 2 repeat 3 average
real 1m27.771s 1m27.996s 1m28.953s 1m28.240s
user 1m27.200s 1m27.540s 1m28.500s 1m27.747s
sys 0m0.560s 0m0.450s 0m0.450s 0m0.487s
After:
repeat 1 repeat 2 repeat 3 average speed up:
real 0m25.984s 0m25.996s 0m25.714s 0m25.898s 3.4x
user 0m25.780s 0m25.750s 0m25.510s 0m25.680s 3.4x
sys 0m0.210s 0m0.240s 0m0.200s 0m0.217s 2.2x
# Worse case scenario:
time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
"large.blast"); while ($result = $sio->next_result) { while ($hit =
$result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query;
$hsp->hit; } } }'
Before:
repeat 1 repeat 2 repeat 3 average
real 1m28.104s 1m27.861s 1m27.970s 1m27.978s
user 1m27.600s 1m27.420s 1m27.420s 1m27.480s
sys 0m0.490s 0m0.430s 0m0.550s 0m0.490s
After:
repeat 1 repeat 2 repeat 3 average speed up:
real 0m57.167s 0m57.080s 0m56.903s 0m58.050s 1.5x
user 0m56.700s 0m56.770s 0m56.530s 0m56.667s 1.5x
sys 0m0.470s 0m0.310s 0m0.370s 0m0.383s 1.3x
perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
"medium.blast"); while ($result = $sio->next_result) { while ($hit =
$result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query;
$hsp->hit; } } }'
dprofpp -I before:
Total Elapsed Time = 9.767702 Seconds
User+System Time = 9.577702 Seconds
Inclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
115. 0.568 11.031 2 0.2838 5.5156
Bio::SearchIO::blast::next_result
dprofpp -I after:
Total Elapsed Time = 6.521398 Seconds
User+System Time = 6.571398 Seconds
Inclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
61.2 0.087 4.023 11940 0.0000 0.0003
Bio::Search::HSP::GenericHSP::query
Improvement: now we spend most of our time dealing with the thing we
actually wanted from the parse, not dealing with the parsing process itself.
--------------------------------------------------------
Here are the changes involved in achieving the speed up:
Bio::Search::HSP::GenericHSP
----------------------------
# Implementation changes
Call to new() now calls no methods of its own; no work is done simply to
create a GenericHSP. The code that was previously in new() has been
moved to private methods that are called just-in-time, as the user
desires to know certain information by manually calling HSPI methods.
Added new options to new() -hit_desc and -query_desc for setting the
description text for the sequences.
Bio::Search::Hit::GenericHit
----------------------------
# API-CHANGES
new() has extra option -hsp_factory.
New method hsp_factory() which gets/sets a Bio::Factory::ObjectFactoryI.
add_hsp() can now accept a hash ref instead of just a HSPI.
# Implementation changes
next_hsp() and hsps() convert hash ref hsp data to HSPI objects using
the hsp_factory() as necessary.
# Notes
num_hsps() claimed to throw if there were no HSPs, but returns '-'.
Updated docs.
Bio::Search::Result::GenericResult
----------------------------------
# API-CHANGES
new() has extra option -hit_factory.
New method hit_factory() which gets/sets a Bio::Factory::ObjectFactoryI.
add_hit() can now accept a hash ref instead of just a HitI.
# Implementation changes
next_hit() and hits() convert hash ref hit data to HitI objects using
the hit_factory() as necessary.
Bio::Search::Iteration::GenericIteration
----------------------------------------
# API-CHANGES
new() has extra option -hit_factory.
New method hit_factory() which gets/sets a Bio::Factory::ObjectFactoryI.
add_hit() can now accept a hash ref instead of just a HitI.
# Implementation changes
Various methods convert hash ref hit data to HitI objects using the
hit_factory() as necessary.
Bio::SearchIO::SearchResultEventBuilder
---------------------------------------
# Implementation changes
Methods end_hsp() and end_hit() return hash refs containing data
suitable for creating HitI and HSPI objects respectively.
Bio::SearchIO::IteratedSearchResultEventBuilder
-----------------------------------------------
# Implementation changes
_add_hit() deals with the new way hit information is stored.
end_iteration() supplies the hit factory to created and returned
iteration factories.
Bio::SeqFeature::SimilarityPair
-------------------------------
# Notes
new() makes an object factory if not supplied -feature_factory as arg,
but then does nothing with its freshly created factory, and ignores any
that are supplied with -feature_factory. Now it just doesn't make a
factory at all, to save time.
new() has an option -feature1, but when supplied this is never used or
set. I've left this alone.
More information about the Bioperl-l
mailing list