[Bioperl-l] SeqIO profiling
Hilmar Lapp
hlapp@gnf.org
Thu, 19 Sep 2002 11:23:52 -0700
Another time-waster may be that the fasta-parser actually
instantiates Bio::Seq objects. This one would create an empty
Bio::Annotation::Collection. Also, Bio::Seq internally has a
Bio::PrimarySeq object that it needs to create.
If you're not going to annotate the sequences you should be better
off by passing the $seqio a SeqFactory that will create
Bio::PrimarySeq objects:
my $fact = Bio::Seq::SeqFactory->new(-type => "Bio::PrimarySeq");
$seqio->sequence_factory($fact);
# now loop over $seqio->next_seq()
Alternatively, one could smart-ify the respective accessors in
Bio::Seq to instantiate the objects only once they're requested and
not yet set.
What I listed only explains a factor of 3x, not 6x though ...
-hilmar
On Tuesday, September 17, 2002, at 05:44 AM, Aaron J Mackey wrote:
>
> With my just-committed patches to Root and RootI (BEGIN->INIT and some
> small optimizations to _rearrange), things are marginally better
> (previously this took around 30 seconds, and _rearrange took more
> than 7
> seconds of the time):
>
> Total Elapsed Time = 23.99110 Seconds
> User Time = 26.62 Seconds
> Exclusive Times
> %Time ExclSec CumulS #Calls sec/call Csec/c Name
> 16.0 4.270 4.270 119910 0.0000 0.0000
> Bio::Root::RootI::_rearrange
> 10.2 2.740 4.700 119910 0.0000 0.0000 Bio::Root::RootI::new
> 8.08 2.150 10.480 47962 0.0000 0.0002 Bio::PrimarySeq::new
> 7.59 2.020 25.150 23982 0.0001 0.0010
> Bio::SeqIO::fasta::next_seq
> 6.12 1.630 17.210 23990 0.0001 0.0007 Bio::Seq::BEGIN
> 4.92 1.310 1.310 143892 0.0000 0.0000 Bio::Root::Root::verbose
> 4.43 1.180 3.580 47962 0.0000 0.0001 Bio::PrimarySeq::seq
> 3.08 0.820 0.820 95924 0.0000 0.0000 Bio::PrimarySeq::alphabet
> 2.97 0.790 0.790 119910 0.0000 0.0000
> Bio::Root::Root::_create_object
> 2.89 0.770 3.630 23981 0.0000 0.0002
> Bio::Annotation::Collection::new
> 2.89 0.770 1.830 23981 0.0000 0.0001
> Bio::Annotation::TypeManager::new
> 2.85 0.760 1.490 23981 0.0000 0.0001
> Bio::PrimarySeq::_guess_alphabet
> 2.48 0.660 3.510 47962 0.0000 0.0001 Bio::SeqIO::alphabet
> 2.37 0.630 18.190 23981 0.0000 0.0008
> Bio::Seq::SeqFactory::create
> 2.22 0.590 0.750 23982 0.0000 0.0000 Bio::Root::IO::_readline
>
> I still don't understand why we're getting 6 times as many
> Bio::Root::RootI::new() calls (and 2 times as many PrimarySeq::new()
> calls) as we are next_seq() calls ... ?!?!
>
> So that folks can play along at home, I've attached the testing script,
> and you can download my test protein database (~5 Mb, compressed) at:
> http://www.people.virginia.edu/~ajm6q/sp_protein.lseg.gz
>
> % perl -d:DProf testSeqIO.pl sp_protein.lseg
> % dprofpp -u
>
> [ ... etc ... ]
>
> -Aaron
>
> On Tue, 17 Sep 2002, Tim Bunce wrote:
>
>> Aaron,
>>
>> Can you send me the script and anything else I'd need to be able
>> to reproduce this data? Inclusing the data file. I'd like to take
>> a look at it (performance tuning is a particular interest of mine).
>>
>> What version of bioperl were you using? 1.0.2, or from the repository?
>>
>> Tim.
>>
>> On Thu, Sep 12, 2002 at 03:40:38PM -0400, Aaron J Mackey wrote:
>>>
>>> here's some more fuel for the fire; this is a profiling run from a
>>> super-simple script that invokes a SeqIO::fasta on a 24K sequence
>>> protein
>>> database, and does nothing else but call next_seq ... 24K calls to
>>> next_seq, but 120K calls to _rearrange and BEGIN!!
>>>
>>> Total Elapsed Time = 30.84184 Seconds
>>> User Time = 30.51116 Seconds
>>> Exclusive Times
>>> %Time ExclSec CumulS #Calls sec/call Csec/c Name
>>> 23.0 7.040 6.980 119910 0.0001 0.0001
>>> Bio::Root::RootI::_rearrange
>>> 11.6 3.539 5.530 119914 0.0000 0.0000 Bio::Root::RootI::BEGIN
>>> 9.96 3.039 30.811 23982 0.0001 0.0013
>>> Bio::SeqIO::fasta::next_seq
>>> 8.81 2.689 12.976 47969 0.0001 0.0003 Bio::PrimarySeq::BEGIN
>>> 5.67 1.730 1.658 143892 0.0000 0.0000 Bio::Root::Root::verbose
>>> 5.21 1.589 20.041 23981 0.0001 0.0008 Bio::Seq::new
>>> 3.77 1.150 3.434 47962 0.0000 0.0001 Bio::PrimarySeq::seq
>>> 3.02 0.920 5.134 47962 0.0000 0.0001 Bio::SeqIO::alphabet
>>> 2.92 0.890 2.454 23981 0.0000 0.0001
>>> Bio::Annotation::TypeManager::new
>>> 2.88 0.880 4.514 23981 0.0000 0.0002
>>> Bio::Annotation::Collection::new
>>> 2.75 0.840 0.792 95924 0.0000 0.0000
>>> Bio::PrimarySeq::alphabet
>>> 2.72 0.830 0.850 119917 0.0000 0.0000 Bio::Root::Root::BEGIN
>>> 2.65 0.810 21.299 23981 0.0000 0.0009
>>> Bio::Seq::SeqFactory::create
>>> 2.52 0.770 0.734 71943 0.0000 0.0000
>>> Bio::Annotation::TypeManager::_add
>>> _type_map
>>> 2.43 0.740 1.336 23981 0.0000 0.0001
>>> Bio::PrimarySeq::_guess_alphabet
>>>
>>> here's what the call trace looks like, for each $io->next_seq call:
>>>
>>> Bio::SeqIO::fasta::next_seq
>>> Bio::Root::IO::_readline
>>> File::Temp::BEGIN
>>> Bio::SeqIO::alphabet
>>> Bio::SeqIO::sequence_factory
>>> Bio::Seq::SeqFactory::create
>>> Bio::Seq::SeqFactory::type
>>> Bio::Root::Root::verbose
>>> Bio::Seq::new
>>> Bio::Root::RootI::new
>>> Bio::Root::Root::BEGIN
>>> Bio::Root::Root::verbose
>>> Bio::PrimarySeq::BEGIN
>>> Bio::Root::RootI::new
>>> Bio::Root::Root::BEGIN
>>> Bio::Root::Root::verbose
>>> Bio::Root::RootI::_rearrange
>>> Bio::PrimarySeq::seq
>>> Bio::Root::RootI::_rearrange
>>> Bio::PrimarySeq::validate_seq
>>> Bio::PrimarySeq::alphabet
>>> Bio::PrimarySeq::_guess_alphabet
>>> Bio::PrimarySeq::seq
>>> Bio::Root::RootI::_rearrange
>>> Bio::PrimarySeq::alphabet
>>> Bio::PrimarySeq::display_id
>>> Bio::PrimarySeq::primary_id
>>> Bio::PrimarySeq::desc
>>> Bio::Root::RootI::_rearrange
>>> Bio::Seq::primary_id
>>> Bio::Annotation::Collection::new
>>> Bio::Root::RootI::new
>>> Bio::Root::Root::BEGIN
>>> Bio::Root::Root::verbose
>>> Bio::Annotation::TypeManager::new
>>> Bio::Root::RootI::new
>>> Bio::Root::Root::BEGIN
>>> Bio::Root::Root::verbose
>>> Bio::Annotation::TypeManager::_add_type_map
>>> Bio::Annotation::TypeManager::_add_type_map
>>> Bio::Annotation::TypeManager::_add_type_map
>>> Bio::Annotation::Collection::_typemap
>>> Bio::Seq::annotation
>>> Bio::Seq::primary_seq
>>> UNIVERSAL::isa
>>> Bio::Seq::top_SeqFeatures
>>> Bio::Seq::alphabet
>>> Bio::Seq::primary_seq
>>> Bio::PrimarySeq::alphabet
>>> Bio::SeqIO::alphabet
>>> Bio::PrimarySeq::BEGIN
>>> Bio::Root::RootI::new
>>> Bio::Root::Root::BEGIN
>>> Bio::Root::Root::verbose
>>> Bio::Root::RootI::_rearrange
>>> Bio::PrimarySeq::alphabet
>>> Bio::Root::RootI::DESTROY
>>> Bio::Root::Root::_cleanup_methods
>>> Bio::Seq::DESTROY
>>> Bio::Root::RootI::DESTROY
>>> Bio::Root::Root::_cleanup_methods
>>> Bio::Root::RootI::DESTROY
>>> Bio::Root::Root::_cleanup_methods
>>> Bio::Root::RootI::DESTROY
>>> Bio::Root::Root::_cleanup_methods
>>>
>>>
>>> There's a fair amount of handling in Bio::Root::Root BEGIN that
>>> involves setting up Error usage; a $DONT_USE_ERROR in my test script
>>> improves things a little (down to 25 seconds instead of 30), but
>>> doesn't
>>> change much of the profile percentages.
>>>
>>> -Aaron
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@bioperl.org
>>> http://bioperl.org/mailman/listinfo/bioperl-l
>>
>
> --
> Aaron J Mackey
> Pearson Laboratory
> University of Virginia
> (434) 924-2821
> amackey@virginia.edu
>
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------