[Bioperl-l] Bio::PrimarySeq speedup

Mon Nov 5 00:46:44 UTC 2012

I am planning on merging the branch with master this week.
Best,
Florent

On 01/11/12 15:49, Florent Angly wrote:
> Hi all,
>
> I was working with Ben Woodcroft on identifying ways to speed up 
> Grinder, which relies heavily on Bioperl. Ben did some profiling with 
> NYTProf and we realized that a lot of computation time was spent in 
> Bio::PrimarySeq, doing calls to subseq() and length(). The sequences 
> we used for the profiling were microbial genomes, i.e. several Mbp 
> long sequences, which is quite long. A lot of the performance cost was 
> associated with passing full genomes between functions. For example, 
> when doing a call to length(), length() requests the full sequence 
> from seq(), which returns it back to length() (it makes a copy!). So, 
> every call to length is very expensive for long sequences. And there 
> is a lot of code that calls length(), for error checking.
>
> I know that there are a few Bioperl modules that are more adapted to 
> handling very long sequences, e.g. Bio::DB::Fasta or 
> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at 
> Bio::PrimarySeq with Ben and I released this commit: 
> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
> But in fact, there were more things that I wanted to try to improve, 
> which led me to start this new branch: 
> https://github.com/bioperl/bioperl-live/tree/seqlength
>
> I wrote quite a few tests for functionalities that were not previously 
> covered by tests, and tried to improve the documentation. In addition, 
> to address the speed issue, I did some changes to Bio::PrimarySeq and 
> Bio::PrimarySeqI :
> • The length of a sequence is now computed as soon as the sequence is 
> set, not after. This way, there is no extra call to seq() (which would 
> incur the cost of copying the entire sequence between functions).
> • The length is saved as an object attribute. So, calling length() is 
> very cheap since it only needs to retrieve the stored value for the 
> length.
> • There is a constructor called -direct, which skips sequence 
> validation. However, it was only active in conjunction with the 
> -ref_to_seq constructor. To make -direct conform better to its 
> documented purpose, I made it -direct work when a sequence is set 
> through -seq as well.
> • This brings us to trunc(), revcom() and other methods of 
> Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq 
> object from an existing (already validated!) Bio::PrimarySeq object, 
> the new object can be constructed with the -direct constructor, to 
> save some time.
> • Finally, I noticed that subseq() used calls to eval() to do its 
> work. eval() is notoriously slow and these calls were easily replaced 
> by simple calls to substr() to save some time.
>
> A real-world test I performed with Grinder took 3m28s before the 
> changes (and ~1 min is spent doing something unrelated). After the 
> changes, the same test took only 2min28s. So, it's quite a significant 
> improvement and on more specific test cases, performance gains can 
> obviously be much bigger. Also, I anticipate that the gains would be 
> bigger for even longer sequences.
>
> All the changes I made are meant to be backward compatible and all the 
> tests in the Bioperl test suite passed. So, there _should_ not be any 
> issues. However, I know that Bio::PrimarySeq is a central module of 
> Bioperl, so please, have a look at it and let me know if there are any 
> glaring errors.
>
> Thanks,
>
> Florent
>