[Bioperl-l] Bio::PrimarySeq speedup

Fields, Christopher J cjfields at illinois.edu
Mon Nov 5 02:43:28 UTC 2012


Florent,

Ran tests on it, they pass but I am seeing this (if these are expected, you can catch the warnings using Test::Warn):

[cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr t/Seq/PrimarySeq.t 
t/Seq/PrimarySeq.t .. 1/167 
--------------------- WARNING ---------------------
MSG: Got a sequence without letters. Could not guess alphabet
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is \,$,+
---------------------------------------------------

--------------------- WARNING ---------------------
MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
---------------------------------------------------
t/Seq/PrimarySeq.t .. ok       
All tests successful.
Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.18 cusr  0.01 csys =  0.23 CPU)
Result: PASS

chris

On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
 wrote:

> I am planning on merging the branch with master this week.
> Best,
> Florent
> 
> 
> On 01/11/12 15:49, Florent Angly wrote:
>> Hi all,
>> 
>> I was working with Ben Woodcroft on identifying ways to speed up Grinder, which relies heavily on Bioperl. Ben did some profiling with NYTProf and we realized that a lot of computation time was spent in Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we used for the profiling were microbial genomes, i.e. several Mbp long sequences, which is quite long. A lot of the performance cost was associated with passing full genomes between functions. For example, when doing a call to length(), length() requests the full sequence from seq(), which returns it back to length() (it makes a copy!). So, every call to length is very expensive for long sequences. And there is a lot of code that calls length(), for error checking.
>> 
>> I know that there are a few Bioperl modules that are more adapted to handling very long sequences, e.g. Bio::DB::Fasta or Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at Bio::PrimarySeq with Ben and I released this commit: https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. But in fact, there were more things that I wanted to try to improve, which led me to start this new branch: https://github.com/bioperl/bioperl-live/tree/seqlength
>> 
>> I wrote quite a few tests for functionalities that were not previously covered by tests, and tried to improve the documentation. In addition, to address the speed issue, I did some changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>> • The length of a sequence is now computed as soon as the sequence is set, not after. This way, there is no extra call to seq() (which would incur the cost of copying the entire sequence between functions).
>> • The length is saved as an object attribute. So, calling length() is very cheap since it only needs to retrieve the stored value for the length.
>> • There is a constructor called -direct, which skips sequence validation. However, it was only active in conjunction with the -ref_to_seq constructor. To make -direct conform better to its documented purpose, I made it -direct work when a sequence is set through -seq as well.
>> • This brings us to trunc(), revcom() and other methods of Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq object from an existing (already validated!) Bio::PrimarySeq object, the new object can be constructed with the -direct constructor, to save some time.
>> • Finally, I noticed that subseq() used calls to eval() to do its work. eval() is notoriously slow and these calls were easily replaced by simple calls to substr() to save some time.
>> 
>> A real-world test I performed with Grinder took 3m28s before the changes (and ~1 min is spent doing something unrelated). After the changes, the same test took only 2min28s. So, it's quite a significant improvement and on more specific test cases, performance gains can obviously be much bigger. Also, I anticipate that the gains would be bigger for even longer sequences.
>> 
>> All the changes I made are meant to be backward compatible and all the tests in the Bioperl test suite passed. So, there _should_ not be any issues. However, I know that Bio::PrimarySeq is a central module of Bioperl, so please, have a look at it and let me know if there are any glaring errors.
>> 
>> Thanks,
>> 
>> Florent
>> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list