[Bioperl-l] Bio::PrimarySeq speedup

Thu Nov 15 16:29:30 UTC 2012

I now merged the branch with master.
Best,
Florent

On 06/11/12 21:06, Florent Angly wrote:
> Yes, good idea, Chris.
>
> Actually, thinking about it, most of these warnings were redundant. 
> So, I changed the behaviour of Bio::PrimarySeq::validate_seq() so that 
> it issues exceptions if requested.
>
> Florent
>
>
> On 05/11/12 12:43, Fields, Christopher J wrote:
>> Florent,
>>
>> Ran tests on it, they pass but I am seeing this (if these are 
>> expected, you can catch the warnings using Test::Warn):
>>
>> [cjfields at pyrimidine-laptop bioperl-live (seqlength)]$ prove -lr 
>> t/Seq/PrimarySeq.t
>> t/Seq/PrimarySeq.t .. 1/167
>> --------------------- WARNING ---------------------
>> MSG: Got a sequence without letters. Could not guess alphabet
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is !
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is $
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is &
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is 
>> \,$,+
>> ---------------------------------------------------
>>
>> --------------------- WARNING ---------------------
>> MSG: sequence '[unidentified sequence]' doesn't validate, mismatch is @/
>> ---------------------------------------------------
>> t/Seq/PrimarySeq.t .. ok
>> All tests successful.
>> Files=1, Tests=167,  0 wallclock secs ( 0.03 usr  0.01 sys + 0.18 
>> cusr  0.01 csys =  0.23 CPU)
>> Result: PASS
>>
>> chris
>>
>> On Nov 4, 2012, at 6:46 PM, Florent Angly <florent.angly at gmail.com>
>>   wrote:
>>
>>> I am planning on merging the branch with master this week.
>>> Best,
>>> Florent
>>>
>>>
>>> On 01/11/12 15:49, Florent Angly wrote:
>>>> Hi all,
>>>>
>>>> I was working with Ben Woodcroft on identifying ways to speed up 
>>>> Grinder, which relies heavily on Bioperl. Ben did some profiling 
>>>> with NYTProf and we realized that a lot of computation time was 
>>>> spent in Bio::PrimarySeq, doing calls to subseq() and length(). The 
>>>> sequences we used for the profiling were microbial genomes, i.e. 
>>>> several Mbp long sequences, which is quite long. A lot of the 
>>>> performance cost was associated with passing full genomes between 
>>>> functions. For example, when doing a call to length(), length() 
>>>> requests the full sequence from seq(), which returns it back to 
>>>> length() (it makes a copy!). So, every call to length is very 
>>>> expensive for long sequences. And there is a lot of code that calls 
>>>> length(), for error checking.
>>>>
>>>> I know that there are a few Bioperl modules that are more adapted 
>>>> to handling very long sequences, e.g. Bio::DB::Fasta or 
>>>> Bio::Seq::LargePrimarySeq. Nevertheless, I decided to have a look 
>>>> at Bio::PrimarySeq with Ben and I released this commit: 
>>>> https://github.com/bioperl/bioperl-live/commit/7436a1b2e2cf9f0ab75a9cd2d78787c7015ef9e5. 
>>>> But in fact, there were more things that I wanted to try to 
>>>> improve, which led me to start this new branch: 
>>>> https://github.com/bioperl/bioperl-live/tree/seqlength
>>>>
>>>> I wrote quite a few tests for functionalities that were not 
>>>> previously covered by tests, and tried to improve the 
>>>> documentation. In addition, to address the speed issue, I did some 
>>>> changes to Bio::PrimarySeq and Bio::PrimarySeqI :
>>>> • The length of a sequence is now computed as soon as the sequence 
>>>> is set, not after. This way, there is no extra call to seq() (which 
>>>> would incur the cost of copying the entire sequence between 
>>>> functions).
>>>> • The length is saved as an object attribute. So, calling length() 
>>>> is very cheap since it only needs to retrieve the stored value for 
>>>> the length.
>>>> • There is a constructor called -direct, which skips sequence 
>>>> validation. However, it was only active in conjunction with the 
>>>> -ref_to_seq constructor. To make -direct conform better to its 
>>>> documented purpose, I made it -direct work when a sequence is set 
>>>> through -seq as well.
>>>> • This brings us to trunc(), revcom() and other methods of 
>>>> Bio::PrimarySeqI. Since all these methods create a new 
>>>> Bio::PrimarySeq object from an existing (already validated!) 
>>>> Bio::PrimarySeq object, the new object can be constructed with the 
>>>> -direct constructor, to save some time.
>>>> • Finally, I noticed that subseq() used calls to eval() to do its 
>>>> work. eval() is notoriously slow and these calls were easily 
>>>> replaced by simple calls to substr() to save some time.
>>>>
>>>> A real-world test I performed with Grinder took 3m28s before the 
>>>> changes (and ~1 min is spent doing something unrelated). After the 
>>>> changes, the same test took only 2min28s. So, it's quite a 
>>>> significant improvement and on more specific test cases, 
>>>> performance gains can obviously be much bigger. Also, I anticipate 
>>>> that the gains would be bigger for even longer sequences.
>>>>
>>>> All the changes I made are meant to be backward compatible and all 
>>>> the tests in the Bioperl test suite passed. So, there _should_ not 
>>>> be any issues. However, I know that Bio::PrimarySeq is a central 
>>>> module of Bioperl, so please, have a look at it and let me know if 
>>>> there are any glaring errors.
>>>>
>>>> Thanks,
>>>>
>>>> Florent
>>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>