[Bioperl-l] SeqIO profiling
Steve Chervitz
sac@bioperl.org
Thu, 19 Sep 2002 10:48:10 -0700 (PDT)
--- Ewan Birney <birney@ebi.ac.uk> wrote:
> On Tue, 17 Sep 2002, Aaron J Mackey wrote:
>
> >
> > With my just-committed patches to Root and RootI (BEGIN->INIT and some
> > small optimizations to _rearrange), things are marginally better
> > (previously this took around 30 seconds, and _rearrange took more than 7
> > seconds of the time):
> >
> > Total Elapsed Time = 23.99110 Seconds
> > User Time = 26.62 Seconds
> > Exclusive Times
> > %Time ExclSec CumulS #Calls sec/call Csec/c Name
> > 16.0 4.270 4.270 119910 0.0000 0.0000 Bio::Root::RootI::_rearrange
> > 10.2 2.740 4.700 119910 0.0000 0.0000 Bio::Root::RootI::new
> > 8.08 2.150 10.480 47962 0.0000 0.0002 Bio::PrimarySeq::new
> > 7.59 2.020 25.150 23982 0.0001 0.0010 Bio::SeqIO::fasta::next_seq
> > 6.12 1.630 17.210 23990 0.0001 0.0007 Bio::Seq::BEGIN
> > 4.92 1.310 1.310 143892 0.0000 0.0000 Bio::Root::Root::verbose
> > 4.43 1.180 3.580 47962 0.0000 0.0001 Bio::PrimarySeq::seq
> > 3.08 0.820 0.820 95924 0.0000 0.0000 Bio::PrimarySeq::alphabet
> > 2.97 0.790 0.790 119910 0.0000 0.0000
> Bio::Root::Root::_create_object
> > 2.89 0.770 3.630 23981 0.0000 0.0002
> Bio::Annotation::Collection::new
> > 2.89 0.770 1.830 23981 0.0000 0.0001
> Bio::Annotation::TypeManager::new
> > 2.85 0.760 1.490 23981 0.0000 0.0001
> Bio::PrimarySeq::_guess_alphabet
> > 2.48 0.660 3.510 47962 0.0000 0.0001 Bio::SeqIO::alphabet
> > 2.37 0.630 18.190 23981 0.0000 0.0008 Bio::Seq::SeqFactory::create
> > 2.22 0.590 0.750 23982 0.0000 0.0000 Bio::Root::IO::_readline
> >
> > I still don't understand why we're getting 6 times as many
> > Bio::Root::RootI::new() calls (and 2 times as many PrimarySeq::new()
> > calls) as we are next_seq() calls ... ?!?!
> >
>
> Very weird. Of course we build a Bio::Seq and that makes an
> annotatiocollection I notice. Perhaps we should build an annotation
> collection on demand in here
>
>
> The clearest optimisation is to have some "special" constructors which
> SeqIO knows about which by passes the _rerrange stuff.
>
There's also a boat load of verbose() calls that can probably be reduced. We
can get rid of most of these by setting verbosity directly rather than calling
$self->verbose() within Bio::Root::RootI::new(). This seems reasonable since
we're just setting it here. (Then we should then move the verbose()
implementation into RootI.pm as well, though it would be best to have all impls
in Root.pm).
Another suggestion is to not call _create_object(), but to do the bless{}-ing
directly in new() and do away with _create_object(). There aren't any modules
that are using it within Bioperl, and it's a fairly recent addition, so it's
probably safe to deprecate-then-remove it.
I'm still perplexed about all of the repeated BEGIN calls that Aaron found (see
below). From what I have read and tested, BEGIN should be called once during
compilation time. Having duplicate use/require statements across many modules
shouldn't matter since require is smart enough to only compile a module once.
Here's a little test I did that seems to bear this out:
package BeginTest;
BEGIN {
print "Called BEGIN\n";
}
sub new {
my ($class, %args) = @_;
$class = ref($class) if ref($class);
return bless { %args }, $class;
}
package main;
# Note: I get the same result whether or not BeginTest is
# defined in a separate module and I call 'use BeginTest' in the script,
# or if it's defined in the same file as the script.
for (1..5) {
require BeginTest;
my $obj = BeginTest->new( -name=> $_, -foo => "foo$_" );
print "Object $obj->{-name}: foo=$obj->{-foo}\n";
}
Here's the output:
Called BEGIN
Object 1: foo=foo1
Object 2: foo=foo2
Object 3: foo=foo3
Object 4: foo=foo4
Object 5: foo=foo5
Maybe there was something about the profiling environment that lead to all of
the BEGIN calls that Aaron found. Aaron, can you run your profiler on the above
simple script and see if you get multiple BEGIN hits?
Steve
>
>
> > So that folks can play along at home, I've attached the testing script,
> > and you can download my test protein database (~5 Mb, compressed) at:
> > http://www.people.virginia.edu/~ajm6q/sp_protein.lseg.gz
> >
> > % perl -d:DProf testSeqIO.pl sp_protein.lseg
> > % dprofpp -u
> >
> > [ ... etc ... ]
> >
> > -Aaron
> >
> > On Tue, 17 Sep 2002, Tim Bunce wrote:
> >
> > > Aaron,
> > >
> > > Can you send me the script and anything else I'd need to be able
> > > to reproduce this data? Inclusing the data file. I'd like to take
> > > a look at it (performance tuning is a particular interest of mine).
> > >
> > > What version of bioperl were you using? 1.0.2, or from the repository?
> > >
> > > Tim.
> > >
> > > On Thu, Sep 12, 2002 at 03:40:38PM -0400, Aaron J Mackey wrote:
> > > >
> > > > here's some more fuel for the fire; this is a profiling run from a
> > > > super-simple script that invokes a SeqIO::fasta on a 24K sequence
> protein
> > > > database, and does nothing else but call next_seq ... 24K calls to
> > > > next_seq, but 120K calls to _rearrange and BEGIN!!
> > > >
> > > > Total Elapsed Time = 30.84184 Seconds
> > > > User Time = 30.51116 Seconds
> > > > Exclusive Times
> > > > %Time ExclSec CumulS #Calls sec/call Csec/c Name
> > > > 23.0 7.040 6.980 119910 0.0001 0.0001
> Bio::Root::RootI::_rearrange
> > > > 11.6 3.539 5.530 119914 0.0000 0.0000 Bio::Root::RootI::BEGIN
> > > > 9.96 3.039 30.811 23982 0.0001 0.0013
> Bio::SeqIO::fasta::next_seq
> > > > 8.81 2.689 12.976 47969 0.0001 0.0003 Bio::PrimarySeq::BEGIN
> > > > 5.67 1.730 1.658 143892 0.0000 0.0000 Bio::Root::Root::verbose
> > > > 5.21 1.589 20.041 23981 0.0001 0.0008 Bio::Seq::new
> > > > 3.77 1.150 3.434 47962 0.0000 0.0001 Bio::PrimarySeq::seq
> > > > 3.02 0.920 5.134 47962 0.0000 0.0001 Bio::SeqIO::alphabet
> > > > 2.92 0.890 2.454 23981 0.0000 0.0001
> Bio::Annotation::TypeManager::new
> > > > 2.88 0.880 4.514 23981 0.0000 0.0002
> Bio::Annotation::Collection::new
> > > > 2.75 0.840 0.792 95924 0.0000 0.0000 Bio::PrimarySeq::alphabet
> > > > 2.72 0.830 0.850 119917 0.0000 0.0000 Bio::Root::Root::BEGIN
> > > > 2.65 0.810 21.299 23981 0.0000 0.0009
> Bio::Seq::SeqFactory::create
> > > > 2.52 0.770 0.734 71943 0.0000 0.0000
> Bio::Annotation::TypeManager::_add
> > > > _type_map
> > > > 2.43 0.740 1.336 23981 0.0000 0.0001
> Bio::PrimarySeq::_guess_alphabet
> > > >
> > > > here's what the call trace looks like, for each $io->next_seq call:
> > > >
> > > > Bio::SeqIO::fasta::next_seq
> > > > Bio::Root::IO::_readline
> > > > File::Temp::BEGIN
> > > > Bio::SeqIO::alphabet
> > > > Bio::SeqIO::sequence_factory
> > > > Bio::Seq::SeqFactory::create
> > > > Bio::Seq::SeqFactory::type
> > > > Bio::Root::Root::verbose
> > > > Bio::Seq::new
> > > > Bio::Root::RootI::new
> > > > Bio::Root::Root::BEGIN
> > > > Bio::Root::Root::verbose
> > > > Bio::PrimarySeq::BEGIN
> > > > Bio::Root::RootI::new
> > > > Bio::Root::Root::BEGIN
> > > > Bio::Root::Root::verbose
> > > > Bio::Root::RootI::_rearrange
> > > > Bio::PrimarySeq::seq
> > > > Bio::Root::RootI::_rearrange
> > > > Bio::PrimarySeq::validate_seq
> > > > Bio::PrimarySeq::alphabet
> > > > Bio::PrimarySeq::_guess_alphabet
> > > > Bio::PrimarySeq::seq
> > > > Bio::Root::RootI::_rearrange
> > > > Bio::PrimarySeq::alphabet
> > > > Bio::PrimarySeq::display_id
> > > > Bio::PrimarySeq::primary_id
> > > > Bio::PrimarySeq::desc
> > > > Bio::Root::RootI::_rearrange
> > > > Bio::Seq::primary_id
> > > > Bio::Annotation::Collection::new
> > > > Bio::Root::RootI::new
> > > > Bio::Root::Root::BEGIN
> > > > Bio::Root::Root::verbose
> > > > Bio::Annotation::TypeManager::new
> > > > Bio::Root::RootI::new
> > > > Bio::Root::Root::BEGIN
> > > > Bio::Root::Root::verbose
> > > > Bio::Annotation::TypeManager::_add_type_map
> > > > Bio::Annotation::TypeManager::_add_type_map
> > > > Bio::Annotation::TypeManager::_add_type_map
> > > > Bio::Annotation::Collection::_typemap
> > > > Bio::Seq::annotation
> > > > Bio::Seq::primary_seq
> > > > UNIVERSAL::isa
> > > > Bio::Seq::top_SeqFeatures
> > > > Bio::Seq::alphabet
> > > > Bio::Seq::primary_seq
> > > > Bio::PrimarySeq::alphabet
> > > > Bio::SeqIO::alphabet
> > > > Bio::PrimarySeq::BEGIN
> > > > Bio::Root::RootI::new
> > > > Bio::Root::Root::BEGIN
> > > > Bio::Root::Root::verbose
> > > > Bio::Root::RootI::_rearrange
> > > > Bio::PrimarySeq::alphabet
> > > > Bio::Root::RootI::DESTROY
> > > > Bio::Root::Root::_cleanup_methods
> > > > Bio::Seq::DESTROY
> > > > Bio::Root::RootI::DESTROY
> > > > Bio::Root::Root::_cleanup_methods
> > > > Bio::Root::RootI::DESTROY
> > > > Bio::Root::Root::_cleanup_methods
> > > > Bio::Root::RootI::DESTROY
> > > > Bio::Root::Root::_cleanup_methods
> > > >
> > > >
> > > > There's a fair amount of handling in Bio::Root::Root BEGIN that
> > > > involves setting up Error usage; a $DONT_USE_ERROR in my test script
> > > > improves things a little (down to 25 seconds instead of 30), but
> doesn't
> > > > change much of the profile percentages.
> > > >
> > > > -Aaron
> > > >
> > > >
> > > > _______________________________________________
> > > > Bioperl-l mailing list
> > > > Bioperl-l@bioperl.org
> > > > http://bioperl.org/mailman/listinfo/bioperl-l
> > >
> >
> > --
> > Aaron J Mackey
> > Pearson Laboratory
> > University of Virginia
> > (434) 924-2821
> > amackey@virginia.edu
> >
> >
>
> -----------------------------------------------------------------
> Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
> <birney@ebi.ac.uk>.
> -----------------------------------------------------------------
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
=====
Steve Chervitz
sac@bioperl.org
__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com