[Bioperl-l] GenBankParser comparison to bioperl parser
Gudmundur Arni Thorisson
mummi@cshl.org
Thu, 12 Sep 2002 12:44:02 -0400
How would one go about reducing the object creation overhead, if I may
be so bold as to ask?
Mummi
On Wednesday, September 11, 2002, at 04:14 PM, Jason Stajich wrote:
> On Wed, 11 Sep 2002, John Kloss wrote:
>
>> I was asked to provided some coarses tests demonstrating the features
>> and speed of the GenBankParser in comparsion to the parser provided in
>> bioperl. I just
>> learned how to use the bioperl interface, so please let me know if I did
>> anything stupid.
>>
>> In all tests I used the same flat file
>>
>> bct1.seq size gzipped 73679187 bytes.
>>
>> All tests run on a realtively loaded Sun Enterprise 4500 8 x 400Mhz
>> sparc with
>> 10Gbs ram.
>>
>> 1st test. Convert GenBank flat file to fasta.
>>
>> ########################################
>> Bioperl parser code:
>>
>> #!/usr/local/bin/perl -w
>>
>> use Bio::SeqIO;
>>
>> $in = Bio::SeqIO->newFh(
>> '-file' => "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |",
>> '-format' => 'Genbank' );
>>
>> $out = Bio::SeqIO->newFh(
>> '-file' => "> output1.txt",
>> '-format' => 'fasta' );
>>
>> print $out $_ while <$in>;
>>
>> ########################################
>> Time:
>>
>> real 17m4.008s
>> user 15m19.500s
>> sys 0m20.850s
>>
>>
>> ########################################
>> GenBankParser code:
>>
>> #!/usr/local/bin/perl -w
>>
>> use GenBankParser qw( DEFINITION LOCUS ORIGIN VERSION );
>>
>> open( FH, "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |" );
>>
>> (new GenBankParser)->parse_file( \*FH, sub {
>> my $parser = shift;
>>
>> print ">gi|", $parser->VERSION->gi,
>> "|gb|", $parser->VERSION->accession,
>> "." , $parser->VERSION->version,
>> "|" , $parser->LOCUS->locus,
>> " " , $parser->DEFINITION, "\n";
>>
>> my $seq = $parser->ORIGIN;
>> pos( $seq ) = 0;
>>
>> print $1, "\n" while $seq =~ m/\G (.{1,60}) /gx;
>> } );
>>
>>
>> ########################################
>> Time:
>>
>> real 1m35.797s
>> user 1m13.600s
>> sys 0m6.350s
>>
>>
>> Next program, dump out the locus identifer, accession number, molecular
>> type,
>> gi, and definition line.
>>
>> ########################################
>> Bioperl code:
>>
>> #!/usr/local/bin/perl -w
>>
>> use Bio::SeqIO;
>>
>> $in = Bio::SeqIO->new(
>> '-file' => "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |",
>> '-format' => 'Genbank' );
>>
>>
>> while (my $seq = $in->next_seq( )) {
>>
>> print $seq->display_id( ), "\n";
>> print $seq->accession_number( ), "\n";
>> print $seq->alphabet( ), "\n";
>> print $seq->primary_id( ), "\n";
>> print $seq->desc( ), "\n";
>> }
>>
>> ########################################
>> Time:
>>
>> real 16m29.771s
>> user 14m43.560s
>> sys 0m18.280s
>>
>>
>> ########################################
>> GenBankParser code:
>>
>> #!/usr/local/bin/perl -w
>>
>> use GenBankParser qw( DEFINITION LOCUS VERSION );
>>
>> open( FH, "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |" );
>>
>> (new GenBankParser)->parse_file( \*FH, sub {
>> my $parser = shift;
>>
>> print $parser->LOCUS->locus, "\n";
>> print $parser->VERSION->accession, "\n";
>> print $parser->LOCUS->type, "\n";
>> print $parser->VERSION->gi, "\n";
>> print $parser->DEFINITION, "\n";
>> } );
>>
>>
>> ########################################
>> Time:
>>
>> real 0m51.068s
>> user 0m40.300s
>> sys 0m3.580s
>>
>>
>> Last program, print out gt2fasta format. That is protien in fasta
>> format.
>> First is GenBankParser code:
>>
>> #!/usr/local/bin/perl -w
>>
>> use strict;
>> use GenBankParser qw( DEFINITION FEATURES SOURCE LOCUS );
>> use GenBankParser::FEATURES qw( CDS );
>>
>> open( FH, "gzip -cd /am/db/gb-ff/gbbct.seq.gz |");
>>
>> (new GenBankParser)->parse_file( \*FH, sub {
>> my $Parser = shift;
>>
>> my $cdscnt = 0;
>> for my $cds (@{ $Parser->FEATURES->CDS }) {
>>
>> $cdscnt++;
>>
>> #
>> # Skip pseudo genes
>> #
>> next if $cds->pseudo;
>>
>> my $definition =
>> $cds->product ||
>> $cds->gene ||
>> (($cds->note) ? join "; ",
>> $Parser->DEFINITION, $cds->note :
>> $Parser->DEFINITION );
>>
>> my $organism = $Parser->SOURCE->organism;
>>
>> print ">gi|", $cds->gi,
>> "|gp|", $cds->accession,
>> "." , $cds->version,
>> "|" , $Parser->LOCUS->locus,
>> "_" , $cdscnt,
>> " " , $definition,
>> " [" , $organism,
>> "]\n";
>>
>> my $seq = $cds->translation;
>> pos( $seq ) = 0;
>>
>> print $1, "\n" while $seq =~ m/\G (.{1,58}) /gx;
>> }
>> });
>>
>>
>> ########################################
>> Time:
>>
>> real 3m58.278s
>> user 2m54.370s
>> sys 0m5.720s
>>
>> And BioPerl code. I just learned the bioperl api so I may not be doing
>> things
>> the most efficient way. If so please tell me an I'll rewrite and rerun
>> my
>> tests.
>>
>> #!/usr/local/bin/perl -w
>>
>> use Bio::SeqIO;
>> use Bio::SeqFeature::Generic;
>>
>> $in = Bio::SeqIO->new(
>> '-file' => "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |",
>> '-format' => 'Genbank' );
>>
>> while (my $seq = $in->next_seq( )) {
>>
>> my $cdscnt = 0;
>>
>> foreach my $feature ( $seq->top_SeqFeatures ) {
>>
>> if ( $feature->primary_tag( ) eq 'CDS' ) {
>>
>> $cdscnt++;
>>
>> next if $feature->has_tag( 'pseudo' );
>>
>> my $definition = undef;
>>
>> if ( $feature->has_tag( 'product' ) ) {
>> $definition = ($feature->each_tag_value( 'product' ))[0];
>> }
>> elsif ( $feature->has_tag( 'gene' ) ) {
>> $definition = ($feature->each_tag_value( 'gene' ))[0];
>> }
>> elsif ( $feature->has_tag( 'note' ) ) {
>> $definition = join "; ",
>> $seq->desc,
>> ($feature->each_tag_value( 'note' ))[0];
>> }
>> else {
>> $definition = $seq->desc;
>> }
>>
>> my $organism = $seq->species->species;
>>
>> my $accession = undef;
>> my $version = undef;
>> my $gi = undef;
>>
>> if ( $feature->has_tag( 'db_xref' ) ) {
>> foreach my $tag ( $feature->each_tag_value( 'db_xref' ) ) {
>> $gi = $1, last if $tag =~ /^GI:(\d+)$/;
>> }
>> }
>>
>> if ( $feature->has_tag( 'protein_id' ) ) {
>> ($feature->each_tag_value( 'protein_id' ))[0] =~
>> /^([^.]+)\.(\d+)$/;
>>
>> $accession = $1;
>> $version = $2;
>> }
>>
>> print ">gi|", $gi,
>> "|gp|", $accession,
>> "." , $version,
>> "|" , $seq->display_id,
>> "_" , $cdscnt,
>> " " , $definition,
>> " [" , $organism,
>> "]\n";
>>
>> my $seq = ($feature->each_tag_value( 'translation' ))[0];
>> pos ( $seq ) = 0;
>>
>> print $1, "\n" while $seq =~ m/\G (.{1,58}) /gx;
>> }
>> }
>> }
>>
>> ########################################
>> Time:
>>
>> real 17m17.897s
>> user 15m23.400s
>> sys 0m19.490s
>>
>>
>> So in all cases the bioperl genbank parsers was 5 to 17 times as slow as
>> the
>> GenBankParser. I think that's pretty dramatic.
>
> I'm the first to admit that the parser system needs an overhaul.
> In work that Ewan and myself did to benchmark things we know that all of
> our overhead that we've seen is in object creation not in the parsing
> itsself. But that is always to be investigated further if someone gets
> excited about this type of thing.
>
>>
>> I also feel that the GenBankParser interface is a little more intuitive--
>> mirroring almost exactly the format of the actual flat file. Of course,
>> I'm
>
> Right but we've designed objects to be datasource independent. It is just
> a different preference - Lincoln's BoulderIO essentially made objects
> that were very tied to the data source This works for a certain set of
> problems, but we're probably trying to solve slightly more general
> problems with different data sources. The idea that we want the
> same interfaces to the data regardless which format it is coming from.
> Anyways I think this is good stuff - I've not look at the code, but I
> think we do need someone to take an interest in seeing how to reduce
> overhead of object creation if we want our parser to be faster.
>
>
>
>> biased in this regard, being the author and all :).
>>
>> John Kloss.
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l@bioperl.org
>> http://bioperl.org/mailman/listinfo/bioperl-l
>>
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>