[Bioperl-l] Bio::PrimarySeq speedup

Thu Nov 1 05:49:13 UTC 2012

Hi all,

I was working with Ben Woodcroft on identifying ways to speed up 
Grinder, which relies heavily on Bioperl. Ben did some profiling with 
NYTProf and we realized that a lot of computation time was spent in 
Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we 
used for the profiling were microbial genomes, i.e. several Mbp long 
sequences, which is quite long. A lot of the performance cost was 
associated with passing full genomes between functions. For example, 
when doing a call to length(), length() requests the full sequence from 
seq(), which returns it back to length() (it makes a copy!). So, every 
call to length is very expensive for long sequences. And there is a lot 
of code that calls length(), for error checking.

I know that there are a few Bioperl modules that are more adapted to 
handling very long sequences, e.g. Bio::DB::Fasta or 
Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at 
Bio::PrimarySeq with Ben and I released this commit: 
https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
But in fact, there were more things that I wanted to try to improve, 
which led me to start this new branch: 
https://github.com/bioperl/bioperl-live/tree/seqlength

I wrote quite a few tests for functionalities that were not previously 
covered by tests, and tried to improve the documentation. In addition, 
to address the speed issue, I did some changes to Bio::PrimarySeq and 
Bio::PrimarySeqI :
• The length of a sequence is now computed as soon as the sequence is 
set, not after. This way, there is no extra call to seq() (which would 
incur the cost of copying the entire sequence between functions).
• The length is saved as an object attribute. So, calling length() is 
very cheap since it only needs to retrieve the stored value for the length.
• There is a constructor called -direct, which skips sequence 
validation. However, it was only active in conjunction with the 
-ref_to_seq constructor. To make -direct conform better to its 
documented purpose, I made it -direct work when a sequence is set 
through -seq as well.
• This brings us to trunc(), revcom() and other methods of 
Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq 
object from an existing (already validated!) Bio::PrimarySeq object, the 
new object can be constructed with the -direct constructor, to save some 
time.
• Finally, I noticed that subseq() used calls to eval() to do its work. 
eval() is notoriously slow and these calls were easily replaced by 
simple calls to substr() to save some time.

A real-world test I performed with Grinder took 3m28s before the changes 
(and ~1 min is spent doing something unrelated). After the changes, the 
same test took only 2min28s. So, it's quite a significant improvement 
and on more specific test cases, performance gains can obviously be much 
bigger. Also, I anticipate that the gains would be bigger for even 
longer sequences.

All the changes I made are meant to be backward compatible and all the 
tests in the Bioperl test suite passed. So, there _should_ not be any 
issues. However, I know that Bio::PrimarySeq is a central module of 
Bioperl, so please, have a look at it and let me know if there are any 
glaring errors.

Thanks,

Florent