[Bioperl-l] SeqIO profiling

Steve Chervitz sac@bioperl.org
Thu, 19 Sep 2002 10:48:10 -0700 (PDT)


--- Ewan Birney <birney@ebi.ac.uk> wrote:
> On Tue, 17 Sep 2002, Aaron J Mackey wrote:
> 
> > 
> > With my just-committed patches to Root and RootI (BEGIN->INIT and some
> > small optimizations to _rearrange), things are marginally better
> > (previously this took around 30 seconds, and _rearrange took more than 7
> > seconds of the time):
> > 
> > Total Elapsed Time = 23.99110 Seconds
> >          User Time =    26.62 Seconds
> > Exclusive Times
> > %Time ExclSec CumulS #Calls sec/call Csec/c  Name
> >  16.0   4.270  4.270 119910   0.0000 0.0000  Bio::Root::RootI::_rearrange
> >  10.2   2.740  4.700 119910   0.0000 0.0000  Bio::Root::RootI::new
> >  8.08   2.150 10.480  47962   0.0000 0.0002  Bio::PrimarySeq::new
> >  7.59   2.020 25.150  23982   0.0001 0.0010  Bio::SeqIO::fasta::next_seq
> >  6.12   1.630 17.210  23990   0.0001 0.0007  Bio::Seq::BEGIN
> >  4.92   1.310  1.310 143892   0.0000 0.0000  Bio::Root::Root::verbose
> >  4.43   1.180  3.580  47962   0.0000 0.0001  Bio::PrimarySeq::seq
> >  3.08   0.820  0.820  95924   0.0000 0.0000  Bio::PrimarySeq::alphabet
> >  2.97   0.790  0.790 119910   0.0000 0.0000 
> Bio::Root::Root::_create_object
> >  2.89   0.770  3.630  23981   0.0000 0.0002 
> Bio::Annotation::Collection::new
> >  2.89   0.770  1.830  23981   0.0000 0.0001 
> Bio::Annotation::TypeManager::new
> >  2.85   0.760  1.490  23981   0.0000 0.0001 
> Bio::PrimarySeq::_guess_alphabet
> >  2.48   0.660  3.510  47962   0.0000 0.0001  Bio::SeqIO::alphabet
> >  2.37   0.630 18.190  23981   0.0000 0.0008  Bio::Seq::SeqFactory::create
> >  2.22   0.590  0.750  23982   0.0000 0.0000  Bio::Root::IO::_readline
> > 
> > I still don't understand why we're getting 6 times as many
> > Bio::Root::RootI::new() calls (and 2 times as many PrimarySeq::new()
> > calls) as we are next_seq() calls ... ?!?!
> > 
> 
> Very weird. Of course we build a Bio::Seq and that makes an 
> annotatiocollection I notice. Perhaps we should build an annotation 
> collection on demand in here 
> 
> 
> The clearest optimisation is to have some "special" constructors which 
> SeqIO knows about which by passes the _rerrange stuff. 
> 

There's also a boat load of verbose() calls that can probably be reduced. We
can get rid of most of these by setting verbosity directly rather than calling
$self->verbose() within Bio::Root::RootI::new(). This seems reasonable since
we're just setting it here. (Then we should then move the verbose()
implementation into RootI.pm as well, though it would be best to have all impls
in Root.pm).

Another suggestion is to not call _create_object(), but to do the bless{}-ing
directly in new() and do away with _create_object(). There aren't any modules
that are using it within Bioperl, and it's a fairly recent addition, so it's
probably safe to deprecate-then-remove it.

I'm still perplexed about all of the repeated BEGIN calls that Aaron found (see
below). From what I have read and tested, BEGIN should be called once during
compilation time. Having duplicate use/require statements across many modules
shouldn't matter since require is smart enough to only compile a module once.

Here's a little test I did that seems to bear this out:

package BeginTest;

BEGIN {
    print "Called BEGIN\n";
}

sub new {
  my ($class, %args) = @_;
  $class = ref($class) if ref($class);
  return bless { %args }, $class;
}

package main;

# Note: I get the same result whether or not BeginTest is
# defined in a separate module and I call 'use BeginTest' in the script,
# or if it's defined in the same file as the script.

for (1..5) {
    require BeginTest;
    my $obj = BeginTest->new( -name=> $_, -foo => "foo$_" );
    print "Object $obj->{-name}: foo=$obj->{-foo}\n";
}

Here's the output:

Called BEGIN
Object 1: foo=foo1
Object 2: foo=foo2
Object 3: foo=foo3
Object 4: foo=foo4
Object 5: foo=foo5

Maybe there was something about the profiling environment that lead to all of
the BEGIN calls that Aaron found. Aaron, can you run your profiler on the above
simple script and see if you get multiple BEGIN hits?

Steve

> 
> 
> > So that folks can play along at home, I've attached the testing script,
> > and you can download my test protein database (~5 Mb, compressed) at:
> > http://www.people.virginia.edu/~ajm6q/sp_protein.lseg.gz
> > 
> > % perl -d:DProf testSeqIO.pl sp_protein.lseg
> > % dprofpp -u
> > 
> > [ ... etc ... ]
> > 
> > -Aaron
> > 
> > On Tue, 17 Sep 2002, Tim Bunce wrote:
> > 
> > > Aaron,
> > >
> > > Can you send me the script and anything else I'd need to be able
> > > to reproduce this data? Inclusing the data file. I'd like to take
> > > a look at it (performance tuning is a particular interest of mine).
> > >
> > > What version of bioperl were you using? 1.0.2, or from the repository?
> > >
> > > Tim.
> > >
> > > On Thu, Sep 12, 2002 at 03:40:38PM -0400, Aaron J Mackey wrote:
> > > >
> > > > here's some more fuel for the fire; this is a profiling run from a
> > > > super-simple script that invokes a SeqIO::fasta on a 24K sequence
> protein
> > > > database, and does nothing else but call next_seq ... 24K calls to
> > > > next_seq, but 120K calls to _rearrange and BEGIN!!
> > > >
> > > > Total Elapsed Time = 30.84184 Seconds
> > > >          User Time = 30.51116 Seconds
> > > > Exclusive Times
> > > > %Time ExclSec CumulS #Calls sec/call Csec/c  Name
> > > >  23.0   7.040  6.980 119910   0.0001 0.0001 
> Bio::Root::RootI::_rearrange
> > > >  11.6   3.539  5.530 119914   0.0000 0.0000  Bio::Root::RootI::BEGIN
> > > >  9.96   3.039 30.811  23982   0.0001 0.0013 
> Bio::SeqIO::fasta::next_seq
> > > >  8.81   2.689 12.976  47969   0.0001 0.0003  Bio::PrimarySeq::BEGIN
> > > >  5.67   1.730  1.658 143892   0.0000 0.0000  Bio::Root::Root::verbose
> > > >  5.21   1.589 20.041  23981   0.0001 0.0008  Bio::Seq::new
> > > >  3.77   1.150  3.434  47962   0.0000 0.0001  Bio::PrimarySeq::seq
> > > >  3.02   0.920  5.134  47962   0.0000 0.0001  Bio::SeqIO::alphabet
> > > >  2.92   0.890  2.454  23981   0.0000 0.0001 
> Bio::Annotation::TypeManager::new
> > > >  2.88   0.880  4.514  23981   0.0000 0.0002 
> Bio::Annotation::Collection::new
> > > >  2.75   0.840  0.792  95924   0.0000 0.0000  Bio::PrimarySeq::alphabet
> > > >  2.72   0.830  0.850 119917   0.0000 0.0000  Bio::Root::Root::BEGIN
> > > >  2.65   0.810 21.299  23981   0.0000 0.0009 
> Bio::Seq::SeqFactory::create
> > > >  2.52   0.770  0.734  71943   0.0000 0.0000 
> Bio::Annotation::TypeManager::_add
> > > >                                              _type_map
> > > >  2.43   0.740  1.336  23981   0.0000 0.0001 
> Bio::PrimarySeq::_guess_alphabet
> > > >
> > > > here's what the call trace looks like, for each $io->next_seq call:
> > > >
> > > > Bio::SeqIO::fasta::next_seq
> > > >    Bio::Root::IO::_readline
> > > >       File::Temp::BEGIN
> > > >    Bio::SeqIO::alphabet
> > > >    Bio::SeqIO::sequence_factory
> > > >    Bio::Seq::SeqFactory::create
> > > >       Bio::Seq::SeqFactory::type
> > > >       Bio::Root::Root::verbose
> > > >       Bio::Seq::new
> > > >          Bio::Root::RootI::new
> > > >             Bio::Root::Root::BEGIN
> > > >             Bio::Root::Root::verbose
> > > >          Bio::PrimarySeq::BEGIN
> > > >             Bio::Root::RootI::new
> > > >                Bio::Root::Root::BEGIN
> > > >                Bio::Root::Root::verbose
> > > >             Bio::Root::RootI::_rearrange
> > > >             Bio::PrimarySeq::seq
> > > >                Bio::Root::RootI::_rearrange
> > > >                Bio::PrimarySeq::validate_seq
> > > >                Bio::PrimarySeq::alphabet
> > > >                Bio::PrimarySeq::_guess_alphabet
> > > >                   Bio::PrimarySeq::seq
> > > >                      Bio::Root::RootI::_rearrange
> > > >                   Bio::PrimarySeq::alphabet
> > > >             Bio::PrimarySeq::display_id
> > > >             Bio::PrimarySeq::primary_id
> > > >             Bio::PrimarySeq::desc
> > > >          Bio::Root::RootI::_rearrange
> > > >          Bio::Seq::primary_id
> > > >          Bio::Annotation::Collection::new
> > > >             Bio::Root::RootI::new
> > > >                Bio::Root::Root::BEGIN
> > > >                Bio::Root::Root::verbose
> > > >             Bio::Annotation::TypeManager::new
> > > >                Bio::Root::RootI::new
> > > >                   Bio::Root::Root::BEGIN
> > > >                   Bio::Root::Root::verbose
> > > >                Bio::Annotation::TypeManager::_add_type_map
> > > >                Bio::Annotation::TypeManager::_add_type_map
> > > >                Bio::Annotation::TypeManager::_add_type_map
> > > >             Bio::Annotation::Collection::_typemap
> > > >          Bio::Seq::annotation
> > > >          Bio::Seq::primary_seq
> > > >             UNIVERSAL::isa
> > > >             Bio::Seq::top_SeqFeatures
> > > >    Bio::Seq::alphabet
> > > >       Bio::Seq::primary_seq
> > > >       Bio::PrimarySeq::alphabet
> > > >    Bio::SeqIO::alphabet
> > > >       Bio::PrimarySeq::BEGIN
> > > >          Bio::Root::RootI::new
> > > >             Bio::Root::Root::BEGIN
> > > >             Bio::Root::Root::verbose
> > > >          Bio::Root::RootI::_rearrange
> > > >          Bio::PrimarySeq::alphabet
> > > >       Bio::Root::RootI::DESTROY
> > > >          Bio::Root::Root::_cleanup_methods
> > > > Bio::Seq::DESTROY
> > > > Bio::Root::RootI::DESTROY
> > > >    Bio::Root::Root::_cleanup_methods
> > > > Bio::Root::RootI::DESTROY
> > > >    Bio::Root::Root::_cleanup_methods
> > > > Bio::Root::RootI::DESTROY
> > > >    Bio::Root::Root::_cleanup_methods
> > > >
> > > >
> > > > There's a fair amount of handling in Bio::Root::Root BEGIN that
> > > > involves setting up Error usage; a $DONT_USE_ERROR in my test script
> > > > improves things a little (down to 25 seconds instead of 30), but
> doesn't
> > > > change much of the profile percentages.
> > > >
> > > > -Aaron
> > > >
> > > >
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l@bioperl.org
> > > > http://bioperl.org/mailman/listinfo/bioperl-l
> > >
> > 
> > -- 
> >  Aaron J Mackey
> >  Pearson Laboratory
> >  University of Virginia
> >  (434) 924-2821
> >  amackey@virginia.edu
> > 
> > 
> 
> -----------------------------------------------------------------
> Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
> <birney@ebi.ac.uk>. 
> -----------------------------------------------------------------
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l


=====
Steve Chervitz
sac@bioperl.org

__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com