[Bioperl-l] Bio::PrimarySeq speedup
Florent Angly
florent.angly at gmail.com
Mon Nov 5 00:46:44 UTC 2012
I am planning on merging the branch with master this week.
Best,
Florent
On 01/11/12 15:49, Florent Angly wrote:
> Hi all,
>
> I was working with Ben Woodcroft on identifying ways to speed up
> Grinder, which relies heavily on Bioperl. Ben did some profiling with
> NYTProf and we realized that a lot of computation time was spent in
> Bio::PrimarySeq, doing calls to subseq() and length(). The sequences
> we used for the profiling were microbial genomes, i.e. several Mbp
> long sequences, which is quite long. A lot of the performance cost was
> associated with passing full genomes between functions. For example,
> when doing a call to length(), length() requests the full sequence
> from seq(), which returns it back to length() (it makes a copy!). So,
> every call to length is very expensive for long sequences. And there
> is a lot of code that calls length(), for error checking.
>
> I know that there are a few Bioperl modules that are more adapted to
> handling very long sequences, e.g. Bio::DB::Fasta or
> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at
> Bio::PrimarySeq with Ben and I released this commit:
> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5.
> But in fact, there were more things that I wanted to try to improve,
> which led me to start this new branch:
> https://github.com/bioperl/bioperl-live/tree/seqlength
>
> I wrote quite a few tests for functionalities that were not previously
> covered by tests, and tried to improve the documentation. In addition,
> to address the speed issue, I did some changes to Bio::PrimarySeq and
> Bio::PrimarySeqI :
> • The length of a sequence is now computed as soon as the sequence is
> set, not after. This way, there is no extra call to seq() (which would
> incur the cost of copying the entire sequence between functions).
> • The length is saved as an object attribute. So, calling length() is
> very cheap since it only needs to retrieve the stored value for the
> length.
> • There is a constructor called -direct, which skips sequence
> validation. However, it was only active in conjunction with the
> -ref_to_seq constructor. To make -direct conform better to its
> documented purpose, I made it -direct work when a sequence is set
> through -seq as well.
> • This brings us to trunc(), revcom() and other methods of
> Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq
> object from an existing (already validated!) Bio::PrimarySeq object,
> the new object can be constructed with the -direct constructor, to
> save some time.
> • Finally, I noticed that subseq() used calls to eval() to do its
> work. eval() is notoriously slow and these calls were easily replaced
> by simple calls to substr() to save some time.
>
> A real-world test I performed with Grinder took 3m28s before the
> changes (and ~1 min is spent doing something unrelated). After the
> changes, the same test took only 2min28s. So, it's quite a significant
> improvement and on more specific test cases, performance gains can
> obviously be much bigger. Also, I anticipate that the gains would be
> bigger for even longer sequences.
>
> All the changes I made are meant to be backward compatible and all the
> tests in the Bioperl test suite passed. So, there _should_ not be any
> issues. However, I know that Bio::PrimarySeq is a central module of
> Bioperl, so please, have a look at it and let me know if there are any
> glaring errors.
>
> Thanks,
>
> Florent
>
More information about the Bioperl-l
mailing list