[Bioperl-l] Performance of Bio::Species

Tue Nov 21 17:20:34 UTC 2006

New Bio::Species implementation seems to degrade significantly 
performance. It seems this happens when the Bio::Tree::Tree is constructed.
See the stats bellow (based on simple Bio::Species object construction, 
script and test sequence file attached):

10000 iterrations
new implementation:
Constructor: 115 wallclock secs (113.50 usr +  0.67 sys = 114.17 CPU)
Accessor:  0 wallclock secs ( 0.17 usr +  0.00 sys =  0.17 CPU)

old implementation (bioperl-1.4
Constructor:  1 wallclock secs ( 0.84 usr +  0.10 sys =  0.94 CPU)
Accessor:  0 wallclock secs ( 0.13 usr +  0.01 sys =  0.14 CPU)

You can see that when reading a genbank file you would double the time 
necessary to construct the Bio::Seq object (100 iterations):
old implementation (bioperl-1.4
Constructor:  0 wallclock secs ( 0.01 usr +  0.00 sys =  0.01 CPU)
Accessor:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
Constructor(seqio)/reading seq:  3 wallclock secs ( 2.51 usr +  0.31 sys 
=  2.83 CPU)

new implementation:
Constructor:  2 wallclock secs ( 1.14 usr +  0.01 sys =  1.15 CPU)
Accessor:  0 wallclock secs ( 0.00 usr +  0.00 sys =  0.00 CPU)
Constructor(seqio)/reading seq:  5 wallclock secs ( 5.10 usr +  0.20 sys 
=  5.30 CPU)

This may not pose a problem to people who read few sequences or files 
with no lineage data, but it could be a significant headache otherwise.

I saw from CVS that Sendu knows there are memory leaks (I find cycles). 
If the classification is supplied incorrectly (includes a reference to 
an array in the classification array) things get really messy (~17 GB of 
RAM for a Bio::Species object), though weird enough the cycle is not 
indefinite. If I have more time I will try to debug this further and 
submit a formal bug report/patch, but I am not sure if I will anytime 
soon. I am sure there are people who understand 
Bio::Taxon/Bio::Tree::Tree better than me and might have better idea how 
to fix this.
Stefan

///
use Bio::Species;
use  Benchmark;
use Bio::SeqIO;

my @classification=qw( sapiens Homo Hominidae
                                        Catarrhini Primates Eutheria
                                        Mammalia Vertebrata Chordata
                                        Metazoa Eukaryota );
my $species;
my $t1 = new Benchmark;
for my $i (1..100) {
    $species = Bio::Species->new(-classification => [@classification]);
}
my $t2 = new Benchmark;
for my $i (1..100) {
    my $bin = $species->binomial;
}
my $t3 = new Benchmark;
print "Constructor: ", timestr(timediff($t2, $t1)),"\n";
print "Accessor: ", timestr(timediff($t3, $t2)),"\n";

my $f=shift;
my $t4= new Benchmark;
for my $i (1..100) {
my $sio=new Bio::SeqIO(-file=>$f,-format=>'genbank');
my $seq=$sio->next_seq;
}
my $t5= new Benchmark;

print "Constructor(seqio)/reading seq: ", timestr(timediff($t5, $t4)),"\n";

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: NM_000161.genbank
URL: <http://lists.open-bio.org/pipermail/bioperl-l/attachments/20061121/5d9e857e/attachment.ksh>