[Bioperl-l] Bio::PrimarySeq speedup
Florent Angly
florent.angly at gmail.com
Thu Nov 1 05:49:13 UTC 2012
Hi all,
I was working with Ben Woodcroft on identifying ways to speed up
Grinder, which relies heavily on Bioperl. Ben did some profiling with
NYTProf and we realized that a lot of computation time was spent in
Bio::PrimarySeq, doing calls to subseq() and length(). The sequences we
used for the profiling were microbial genomes, i.e. several Mbp long
sequences, which is quite long. A lot of the performance cost was
associated with passing full genomes between functions. For example,
when doing a call to length(), length() requests the full sequence from
seq(), which returns it back to length() (it makes a copy!). So, every
call to length is very expensive for long sequences. And there is a lot
of code that calls length(), for error checking.
I know that there are a few Bioperl modules that are more adapted to
handling very long sequences, e.g. Bio::DB::Fasta or
Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look at
Bio::PrimarySeq with Ben and I released this commit:
https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5.
But in fact, there were more things that I wanted to try to improve,
which led me to start this new branch:
https://github.com/bioperl/bioperl-live/tree/seqlength
I wrote quite a few tests for functionalities that were not previously
covered by tests, and tried to improve the documentation. In addition,
to address the speed issue, I did some changes to Bio::PrimarySeq and
Bio::PrimarySeqI :
• The length of a sequence is now computed as soon as the sequence is
set, not after. This way, there is no extra call to seq() (which would
incur the cost of copying the entire sequence between functions).
• The length is saved as an object attribute. So, calling length() is
very cheap since it only needs to retrieve the stored value for the length.
• There is a constructor called -direct, which skips sequence
validation. However, it was only active in conjunction with the
-ref_to_seq constructor. To make -direct conform better to its
documented purpose, I made it -direct work when a sequence is set
through -seq as well.
• This brings us to trunc(), revcom() and other methods of
Bio::PrimarySeqI. Since all these methods create a new Bio::PrimarySeq
object from an existing (already validated!) Bio::PrimarySeq object, the
new object can be constructed with the -direct constructor, to save some
time.
• Finally, I noticed that subseq() used calls to eval() to do its work.
eval() is notoriously slow and these calls were easily replaced by
simple calls to substr() to save some time.
A real-world test I performed with Grinder took 3m28s before the changes
(and ~1 min is spent doing something unrelated). After the changes, the
same test took only 2min28s. So, it's quite a significant improvement
and on more specific test cases, performance gains can obviously be much
bigger. Also, I anticipate that the gains would be bigger for even
longer sequences.
All the changes I made are meant to be backward compatible and all the
tests in the Bioperl test suite passed. So, there _should_ not be any
issues. However, I know that Bio::PrimarySeq is a central module of
Bioperl, so please, have a look at it and let me know if there are any
glaring errors.
Thanks,
Florent
More information about the Bioperl-l
mailing list