[Bioperl-l] Bio::PrimarySeq speedup
Florent Angly
florent.angly at gmail.com
Thu Nov 15 16:29:30 UTC 2012
I now merged the branch with master.
Best,
Florent
On 06/11/12 21:06, Florent Angly wrote:
> Yes, good idea, Chris.
>
> Actually, thinking about it, most of these warnings were redundant.
> So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that
> it issues exceptions if requested.
>
> Florent
>
>
> On 05/11/12 12:43, Fields, Christopher J wrote:
>> Florent,
>>
>> Ran tests on it, they pass but I am seeing this (if these are
>> expected, you can catch the warnings using Test::Warn):
>>
>> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr
>> t/Seq/PrimarySeq.t
>> t/Seq/PrimarySeq.t .. 1/167
>> --------------------- WARNING ---------------------
>> MSG: Got a sequence without letters. Could not guess alphabet
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is
>> \,$,+
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
>> ---------------------------------------------------
>> t/Seq/PrimarySeq.t .. ok
>> All tests successful.
>> Files=1, Tests=167, 0 wallclock secs ( 0.03 usr 0.01 sys + 0.18
>> cusr 0.01 csys = 0.23 CPU)
>> Result: PASS
>>
>> chris
>>
>> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
>> wrote:
>>
>>> I am planning on merging the branch with master this week.
>>> Best,
>>> Florent
>>>
>>>
>>> On 01/11/12 15:49, Florent Angly wrote:
>>>> Hi all,
>>>>
>>>> I was working with Ben Woodcroft on identifying ways to speed up
>>>> Grinder, which relies heavily on Bioperl. Ben did some profiling
>>>> with NYTProf and we realized that a lot of computation time was
>>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The
>>>> sequences we used for the profiling were microbial genomes, i.e.
>>>> several Mbp long sequences, which is quite long. A lot of the
>>>> performance cost was associated with passing full genomes between
>>>> functions. For example, when doing a call to length(), length()
>>>> requests the full sequence from seq(), which returns it back to
>>>> length() (it makes a copy!). So, every call to length is very
>>>> expensive for long sequences. And there is a lot of code that calls
>>>> length(), for error checking.
>>>>
>>>> I know that there are a few Bioperl modules that are more adapted
>>>> to handling very long sequences, e.g. Bio::DB::Fasta or
>>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look
>>>> at Bio::PrimarySeq with Ben and I released this commit:
>>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5.
>>>> But in fact, there were more things that I wanted to try to
>>>> improve, which led me to start this new branch:
>>>> https://github.com/bioperl/bioperl-live/tree/seqlength
>>>>
>>>> I wrote quite a few tests for functionalities that were not
>>>> previously covered by tests, and tried to improve the
>>>> documentation. In addition, to address the speed issue, I did some
>>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>>>> • The length of a sequence is now computed as soon as the sequence
>>>> is set, not after. This way, there is no extra call to seq() (which
>>>> would incur the cost of copying the entire sequence between
>>>> functions).
>>>> • The length is saved as an object attribute. So, calling length()
>>>> is very cheap since it only needs to retrieve the stored value for
>>>> the length.
>>>> • There is a constructor called -direct, which skips sequence
>>>> validation. However, it was only active in conjunction with the
>>>> -ref_to_seq constructor. To make -direct conform better to its
>>>> documented purpose, I made it -direct work when a sequence is set
>>>> through -seq as well.
>>>> • This brings us to trunc(), revcom() and other methods of
>>>> Bio::PrimarySeqI. Since all these methods create a new
>>>> Bio::PrimarySeq object from an existing (already validated!)
>>>> Bio::PrimarySeq object, the new object can be constructed with the
>>>> -direct constructor, to save some time.
>>>> • Finally, I noticed that subseq() used calls to eval() to do its
>>>> work. eval() is notoriously slow and these calls were easily
>>>> replaced by simple calls to substr() to save some time.
>>>>
>>>> A real-world test I performed with Grinder took 3m28s before the
>>>> changes (and ~1 min is spent doing something unrelated). After the
>>>> changes, the same test took only 2min28s. So, it's quite a
>>>> significant improvement and on more specific test cases,
>>>> performance gains can obviously be much bigger. Also, I anticipate
>>>> that the gains would be bigger for even longer sequences.
>>>>
>>>> All the changes I made are meant to be backward compatible and all
>>>> the tests in the Bioperl test suite passed. So, there _should_ not
>>>> be any issues. However, I know that Bio::PrimarySeq is a central
>>>> module of Bioperl, so please, have a look at it and let me know if
>>>> there are any glaring errors.
>>>>
>>>> Thanks,
>>>>
>>>> Florent
>>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
More information about the Bioperl-l
mailing list