[Bioperl-l] SearchIO speed up
Chris Fields
cjfields at uiuc.edu
Thu Aug 10 16:10:57 UTC 2006
Sendu,
You've been busy!
I think that anything that speeds up SearchIO would help
tremendously. We still need to be wary of API issues, but I have no
problem whatsoever adding this in. You might need to give the
community time to digest these changes before committing to HEAD,
though.
Here's a couple of suggestions to get around that if you want to get
the code out there for testing:
Could this be CVS-tagged to an experimental bioperl branch instead?
It could be merged back to the main branch once everybody gets to try
it out, and you could commit changes to the branch (tests, scripts,
etc) along the way based on suggestions. Think of this as a test-
drive for a new Bioperl release.
Alternatively, add the modified modules or patches to Bugzilla for
people to test out. You might want to add a script and test data as
well. These could be updated along the way much like CVS, but
probably a lot noisier.
I agree with the thought of retaining some degree of OO. I still
wonder how much object instantiation really affects speed vs. all
those method calls.
Chris
On Aug 10, 2006, at 10:19 AM, Sendu Bala wrote:
> I am aiming to solve Project priority list item 1.2.1 "Improve
> Bio::SearchIO speed...".
>
> I have made changes that result in a certain speed improvement (up to
> 5x) in parsing BLAST results (and most/all? other SearchIO
> parsers). The
> changes made were as minimal as possible. I did not touch the parsing
> code itself, did not make any API changes except for gain-of-function,
> and did not make any behaviour changes (except for speed!).
> For this reason I don't think there would be any contention about the
> changes so I will just commit them...
>
> ...Except I need to know if the community considers the speed problem
> solved or not. More radical changes will make SearchIO even faster,
> eg.
> Chris Fields and Jason (if I interpret the Project priority list item
> correctly) have suggested an end to individual Hit and HSP objects,
> which become just data members of a Result-like object. Ideally I
> don't
> want to go down that route because we lose quite a bit of OO power;
> HSP
> objects in particular make important use of inheritance and we'd
> end up
> stuffing tens of duplicated-code methods into the Result object
> replacement. Ugh. Or having the Result-like object implement 4
> different
> interfaces. Ugh. That said, judging by the results below, having
> hsps as
> pure hashes would result in a ~5x speedup all the time instead of
> only a
> 1.5x speedup in worst case.
>
> So can people do some of their own speed tests on typical, realistic
> blast-parsing jobs before and after I commit the changes? Then
> everyone
> can decide if more needs to be done. I don't think everyone should do
> the same test - more variety is good. Obviously each individual should
> do the same test before and after. Post your before tests here and
> I'll
> commit sometime this weekend. Thanks for your help!
>
>
> My own speed tests
> ------------------
> #Files used:
> 'medium.blast' = 670k plain-text format blastn with 1 result
> 'large.blast' = 8.2MB plain-text format blastn with 5 results
>
> # Best case scenario:
> time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
> "large.blast"); while ($result = $sio->next_result) { }'
>
> Before:
> repeat 1 repeat 2 repeat 3 average
> real 1m28.065s 1m28.305s 1m28.271s 1m28.214s
> user 1m27.610s 1m27.850s 1m27.860s 1m27.773s
> sys 0m0.440s 0m0.450s 0m0.410s 0m0.433s
>
> After:
> repeat 1 repeat 2 repeat 3 average speed up:
> real 0m16.653s 0m16.711s 0m16.685s 0m16.683s 5.3x
> user 0m16.510s 0m16.570s 0m16.580s 0m16.553s 5.3x
> sys 0m0.130s 0m0.140s 0m0.110s 0m0.127s 3.4x
>
>
> perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
> "medium.blast"); while ($result = $sio->next_result) { }'
>
> dprofpp -I before:
> Total Elapsed Time = 5.309818 Seconds
> User+System Time = 5.209818 Seconds
> Inclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c Name
> 131. 0.473 6.834 2 0.2365 3.4169
> Bio::SearchIO::blast::next_result
> 110. 0.333 5.753 36512 0.0000 0.0002
> Bio::SearchIO::blast::end_element
> 102. 0.315 5.326 2388 0.0001 0.0022
> Bio::SearchIO::SearchResultEventBuilder::end_hsp
> 92.2 0.026 4.804 7266 0.0000 0.0007
> Bio::Factory::ObjectFactory::create_object
> 91.4 0.250 4.765 2388 0.0001 0.0020
> Bio::Search::HSP::GenericHSP::new
> 40.4 0.162 2.105 2388 0.0001 0.0009
> Bio::SeqFeature::SimilarityPair::new
> 39.0 0.173 2.034 9552 0.0000 0.0002
> Bio::SeqFeature::Similarity::new
>
> dprofpp -I after:
> Total Elapsed Time = 1.090717 Seconds
> User+System Time = 1.490717 Seconds
> Inclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c Name
> 98.2 0.333 1.465 2 0.1665 0.7324
> Bio::SearchIO::blast::next_result
> 40.5 0.213 0.604 36512 0.0000 0.0000
> Bio::SearchIO::blast::end_element
> 22.8 - 0.341 34022 - 0.0000
> Bio::SearchIO::blast::element
> 19.8 0.296 0.296 2388 0.0001 0.0001
> Bio::SearchIO::SearchResultEventBuilder::end_hsp
> 15.7 0.156 0.235 41186 0.0000 0.0000
> Bio::SearchIO::blast::characters
> 12.6 0.188 0.188 141495 0.0000 0.0000
> Bio::SearchIO::blast::in_element
> 10.3 0.133 0.154 25795 0.0000 0.0000 Bio::Root::IO::_readline
>
> Improvement: most of our time now spent in the SAX-like bits of the
> parser code, instead of creating HSPI objects. (I might streamline the
> SAX-like bits as well.)
>
>
> # Medium case scenario:
> time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
> "large.blast"); while ($result = $sio->next_result) { while ($hit =
> $result->next_hit) { while ($hsp = $hit->next_hsp) { } } }'
>
> Before:
> repeat 1 repeat 2 repeat 3 average
> real 1m27.771s 1m27.996s 1m28.953s 1m28.240s
> user 1m27.200s 1m27.540s 1m28.500s 1m27.747s
> sys 0m0.560s 0m0.450s 0m0.450s 0m0.487s
>
> After:
> repeat 1 repeat 2 repeat 3 average speed up:
> real 0m25.984s 0m25.996s 0m25.714s 0m25.898s 3.4x
> user 0m25.780s 0m25.750s 0m25.510s 0m25.680s 3.4x
> sys 0m0.210s 0m0.240s 0m0.200s 0m0.217s 2.2x
>
>
> # Worse case scenario:
> time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
> "large.blast"); while ($result = $sio->next_result) { while ($hit =
> $result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query;
> $hsp->hit; } } }'
>
> Before:
> repeat 1 repeat 2 repeat 3 average
> real 1m28.104s 1m27.861s 1m27.970s 1m27.978s
> user 1m27.600s 1m27.420s 1m27.420s 1m27.480s
> sys 0m0.490s 0m0.430s 0m0.550s 0m0.490s
>
> After:
> repeat 1 repeat 2 repeat 3 average speed up:
> real 0m57.167s 0m57.080s 0m56.903s 0m58.050s 1.5x
> user 0m56.700s 0m56.770s 0m56.530s 0m56.667s 1.5x
> sys 0m0.470s 0m0.310s 0m0.370s 0m0.383s 1.3x
>
>
> perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file =>
> "medium.blast"); while ($result = $sio->next_result) { while ($hit =
> $result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query;
> $hsp->hit; } } }'
>
> dprofpp -I before:
> Total Elapsed Time = 9.767702 Seconds
> User+System Time = 9.577702 Seconds
> Inclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c Name
> 115. 0.568 11.031 2 0.2838 5.5156
> Bio::SearchIO::blast::next_result
>
> dprofpp -I after:
> Total Elapsed Time = 6.521398 Seconds
> User+System Time = 6.571398 Seconds
> Inclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c Name
> 61.2 0.087 4.023 11940 0.0000 0.0003
> Bio::Search::HSP::GenericHSP::query
>
> Improvement: now we spend most of our time dealing with the thing we
> actually wanted from the parse, not dealing with the parsing
> process itself.
>
>
>
> --------------------------------------------------------
>
> Here are the changes involved in achieving the speed up:
>
>
> Bio::Search::HSP::GenericHSP
> ----------------------------
>
> # Implementation changes
> Call to new() now calls no methods of its own; no work is done
> simply to
> create a GenericHSP. The code that was previously in new() has been
> moved to private methods that are called just-in-time, as the user
> desires to know certain information by manually calling HSPI methods.
>
> Added new options to new() -hit_desc and -query_desc for setting the
> description text for the sequences.
>
>
> Bio::Search::Hit::GenericHit
> ----------------------------
>
> # API-CHANGES
> new() has extra option -hsp_factory.
>
> New method hsp_factory() which gets/sets a
> Bio::Factory::ObjectFactoryI.
>
> add_hsp() can now accept a hash ref instead of just a HSPI.
>
> # Implementation changes
> next_hsp() and hsps() convert hash ref hsp data to HSPI objects using
> the hsp_factory() as necessary.
>
> # Notes
> num_hsps() claimed to throw if there were no HSPs, but returns '-'.
> Updated docs.
>
>
> Bio::Search::Result::GenericResult
> ----------------------------------
>
> # API-CHANGES
> new() has extra option -hit_factory.
>
> New method hit_factory() which gets/sets a
> Bio::Factory::ObjectFactoryI.
>
> add_hit() can now accept a hash ref instead of just a HitI.
>
> # Implementation changes
> next_hit() and hits() convert hash ref hit data to HitI objects using
> the hit_factory() as necessary.
>
>
> Bio::Search::Iteration::GenericIteration
> ----------------------------------------
>
> # API-CHANGES
> new() has extra option -hit_factory.
>
> New method hit_factory() which gets/sets a
> Bio::Factory::ObjectFactoryI.
>
> add_hit() can now accept a hash ref instead of just a HitI.
>
> # Implementation changes
> Various methods convert hash ref hit data to HitI objects using the
> hit_factory() as necessary.
>
>
> Bio::SearchIO::SearchResultEventBuilder
> ---------------------------------------
>
> # Implementation changes
> Methods end_hsp() and end_hit() return hash refs containing data
> suitable for creating HitI and HSPI objects respectively.
>
>
> Bio::SearchIO::IteratedSearchResultEventBuilder
> -----------------------------------------------
>
> # Implementation changes
> _add_hit() deals with the new way hit information is stored.
>
> end_iteration() supplies the hit factory to created and returned
> iteration factories.
>
>
> Bio::SeqFeature::SimilarityPair
> -------------------------------
>
> # Notes
> new() makes an object factory if not supplied -feature_factory as arg,
> but then does nothing with its freshly created factory, and ignores
> any
> that are supplied with -feature_factory. Now it just doesn't make a
> factory at all, to save time.
>
> new() has an option -feature1, but when supplied this is never used or
> set. I've left this alone.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list