[Bioperl-l] SeqIO profiling

Hilmar Lapp hlapp@gnf.org
Thu, 19 Sep 2002 11:23:52 -0700


Another time-waster may be that the fasta-parser actually 
instantiates Bio::Seq objects. This one would create an empty 
Bio::Annotation::Collection. Also, Bio::Seq internally has a 
Bio::PrimarySeq object that it needs to create.

If you're not going to annotate the sequences you should be better 
off by passing the $seqio a SeqFactory that will create 
Bio::PrimarySeq objects:

     my $fact = Bio::Seq::SeqFactory->new(-type => "Bio::PrimarySeq");
     $seqio->sequence_factory($fact);
     # now loop over $seqio->next_seq()

Alternatively, one could smart-ify the respective accessors in 
Bio::Seq to instantiate the objects only once they're requested and 
not yet set.

What I listed only explains a factor of 3x, not 6x though ...

-hilmar

On Tuesday, September 17, 2002, at 05:44 AM, Aaron J Mackey wrote:

>
> With my just-committed patches to Root and RootI (BEGIN->INIT and some
> small optimizations to _rearrange), things are marginally better
> (previously this took around 30 seconds, and _rearrange took more 
> than 7
> seconds of the time):
>
> Total Elapsed Time = 23.99110 Seconds
>          User Time =    26.62 Seconds
> Exclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
>  16.0   4.270  4.270 119910   0.0000 0.0000  
> Bio::Root::RootI::_rearrange
>  10.2   2.740  4.700 119910   0.0000 0.0000  Bio::Root::RootI::new
>  8.08   2.150 10.480  47962   0.0000 0.0002  Bio::PrimarySeq::new
>  7.59   2.020 25.150  23982   0.0001 0.0010  
> Bio::SeqIO::fasta::next_seq
>  6.12   1.630 17.210  23990   0.0001 0.0007  Bio::Seq::BEGIN
>  4.92   1.310  1.310 143892   0.0000 0.0000  Bio::Root::Root::verbose
>  4.43   1.180  3.580  47962   0.0000 0.0001  Bio::PrimarySeq::seq
>  3.08   0.820  0.820  95924   0.0000 0.0000  Bio::PrimarySeq::alphabet
>  2.97   0.790  0.790 119910   0.0000 0.0000  
> Bio::Root::Root::_create_object
>  2.89   0.770  3.630  23981   0.0000 0.0002  
> Bio::Annotation::Collection::new
>  2.89   0.770  1.830  23981   0.0000 0.0001  
> Bio::Annotation::TypeManager::new
>  2.85   0.760  1.490  23981   0.0000 0.0001  
> Bio::PrimarySeq::_guess_alphabet
>  2.48   0.660  3.510  47962   0.0000 0.0001  Bio::SeqIO::alphabet
>  2.37   0.630 18.190  23981   0.0000 0.0008  
> Bio::Seq::SeqFactory::create
>  2.22   0.590  0.750  23982   0.0000 0.0000  Bio::Root::IO::_readline
>
> I still don't understand why we're getting 6 times as many
> Bio::Root::RootI::new() calls (and 2 times as many PrimarySeq::new()
> calls) as we are next_seq() calls ... ?!?!
>
> So that folks can play along at home, I've attached the testing script,
> and you can download my test protein database (~5 Mb, compressed) at:
> http://www.people.virginia.edu/~ajm6q/sp_protein.lseg.gz
>
> % perl -d:DProf testSeqIO.pl sp_protein.lseg
> % dprofpp -u
>
> [ ... etc ... ]
>
> -Aaron
>
> On Tue, 17 Sep 2002, Tim Bunce wrote:
>
>> Aaron,
>>
>> Can you send me the script and anything else I'd need to be able
>> to reproduce this data? Inclusing the data file. I'd like to take
>> a look at it (performance tuning is a particular interest of mine).
>>
>> What version of bioperl were you using? 1.0.2, or from the repository?
>>
>> Tim.
>>
>> On Thu, Sep 12, 2002 at 03:40:38PM -0400, Aaron J Mackey wrote:
>>>
>>> here's some more fuel for the fire; this is a profiling run from a
>>> super-simple script that invokes a SeqIO::fasta on a 24K sequence 
>>> protein
>>> database, and does nothing else but call next_seq ... 24K calls to
>>> next_seq, but 120K calls to _rearrange and BEGIN!!
>>>
>>> Total Elapsed Time = 30.84184 Seconds
>>>          User Time = 30.51116 Seconds
>>> Exclusive Times
>>> %Time ExclSec CumulS #Calls sec/call Csec/c  Name
>>>  23.0   7.040  6.980 119910   0.0001 0.0001  
>>> Bio::Root::RootI::_rearrange
>>>  11.6   3.539  5.530 119914   0.0000 0.0000  Bio::Root::RootI::BEGIN
>>>  9.96   3.039 30.811  23982   0.0001 0.0013  
>>> Bio::SeqIO::fasta::next_seq
>>>  8.81   2.689 12.976  47969   0.0001 0.0003  Bio::PrimarySeq::BEGIN
>>>  5.67   1.730  1.658 143892   0.0000 0.0000  Bio::Root::Root::verbose
>>>  5.21   1.589 20.041  23981   0.0001 0.0008  Bio::Seq::new
>>>  3.77   1.150  3.434  47962   0.0000 0.0001  Bio::PrimarySeq::seq
>>>  3.02   0.920  5.134  47962   0.0000 0.0001  Bio::SeqIO::alphabet
>>>  2.92   0.890  2.454  23981   0.0000 0.0001  
>>> Bio::Annotation::TypeManager::new
>>>  2.88   0.880  4.514  23981   0.0000 0.0002  
>>> Bio::Annotation::Collection::new
>>>  2.75   0.840  0.792  95924   0.0000 0.0000  
>>> Bio::PrimarySeq::alphabet
>>>  2.72   0.830  0.850 119917   0.0000 0.0000  Bio::Root::Root::BEGIN
>>>  2.65   0.810 21.299  23981   0.0000 0.0009  
>>> Bio::Seq::SeqFactory::create
>>>  2.52   0.770  0.734  71943   0.0000 0.0000  
>>> Bio::Annotation::TypeManager::_add
>>>                                              _type_map
>>>  2.43   0.740  1.336  23981   0.0000 0.0001  
>>> Bio::PrimarySeq::_guess_alphabet
>>>
>>> here's what the call trace looks like, for each $io->next_seq call:
>>>
>>> Bio::SeqIO::fasta::next_seq
>>>    Bio::Root::IO::_readline
>>>       File::Temp::BEGIN
>>>    Bio::SeqIO::alphabet
>>>    Bio::SeqIO::sequence_factory
>>>    Bio::Seq::SeqFactory::create
>>>       Bio::Seq::SeqFactory::type
>>>       Bio::Root::Root::verbose
>>>       Bio::Seq::new
>>>          Bio::Root::RootI::new
>>>             Bio::Root::Root::BEGIN
>>>             Bio::Root::Root::verbose
>>>          Bio::PrimarySeq::BEGIN
>>>             Bio::Root::RootI::new
>>>                Bio::Root::Root::BEGIN
>>>                Bio::Root::Root::verbose
>>>             Bio::Root::RootI::_rearrange
>>>             Bio::PrimarySeq::seq
>>>                Bio::Root::RootI::_rearrange
>>>                Bio::PrimarySeq::validate_seq
>>>                Bio::PrimarySeq::alphabet
>>>                Bio::PrimarySeq::_guess_alphabet
>>>                   Bio::PrimarySeq::seq
>>>                      Bio::Root::RootI::_rearrange
>>>                   Bio::PrimarySeq::alphabet
>>>             Bio::PrimarySeq::display_id
>>>             Bio::PrimarySeq::primary_id
>>>             Bio::PrimarySeq::desc
>>>          Bio::Root::RootI::_rearrange
>>>          Bio::Seq::primary_id
>>>          Bio::Annotation::Collection::new
>>>             Bio::Root::RootI::new
>>>                Bio::Root::Root::BEGIN
>>>                Bio::Root::Root::verbose
>>>             Bio::Annotation::TypeManager::new
>>>                Bio::Root::RootI::new
>>>                   Bio::Root::Root::BEGIN
>>>                   Bio::Root::Root::verbose
>>>                Bio::Annotation::TypeManager::_add_type_map
>>>                Bio::Annotation::TypeManager::_add_type_map
>>>                Bio::Annotation::TypeManager::_add_type_map
>>>             Bio::Annotation::Collection::_typemap
>>>          Bio::Seq::annotation
>>>          Bio::Seq::primary_seq
>>>             UNIVERSAL::isa
>>>             Bio::Seq::top_SeqFeatures
>>>    Bio::Seq::alphabet
>>>       Bio::Seq::primary_seq
>>>       Bio::PrimarySeq::alphabet
>>>    Bio::SeqIO::alphabet
>>>       Bio::PrimarySeq::BEGIN
>>>          Bio::Root::RootI::new
>>>             Bio::Root::Root::BEGIN
>>>             Bio::Root::Root::verbose
>>>          Bio::Root::RootI::_rearrange
>>>          Bio::PrimarySeq::alphabet
>>>       Bio::Root::RootI::DESTROY
>>>          Bio::Root::Root::_cleanup_methods
>>> Bio::Seq::DESTROY
>>> Bio::Root::RootI::DESTROY
>>>    Bio::Root::Root::_cleanup_methods
>>> Bio::Root::RootI::DESTROY
>>>    Bio::Root::Root::_cleanup_methods
>>> Bio::Root::RootI::DESTROY
>>>    Bio::Root::Root::_cleanup_methods
>>>
>>>
>>> There's a fair amount of handling in Bio::Root::Root BEGIN that
>>> involves setting up Error usage; a $DONT_USE_ERROR in my test script
>>> improves things a little (down to 25 seconds instead of 30), but 
>>> doesn't
>>> change much of the profile percentages.
>>>
>>> -Aaron
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@bioperl.org
>>> http://bioperl.org/mailman/listinfo/bioperl-l
>>
>
> --
>  Aaron J Mackey
>  Pearson Laboratory
>  University of Virginia
>  (434) 924-2821
>  amackey@virginia.edu
>
>
--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------