[Bioperl-l] GenBankParser comparison to bioperl parser
Jason Stajich
jason@cgt.mc.duke.edu
Wed, 11 Sep 2002 16:14:42 -0400 (EDT)
On Wed, 11 Sep 2002, John Kloss wrote:
> I was asked to provided some coarses tests demonstrating the features and speed of the GenBankParser in comparsion to the parser provided in bioperl. I just
> learned how to use the bioperl interface, so please let me know if I did
> anything stupid.
>
> In all tests I used the same flat file
>
> bct1.seq size gzipped 73679187 bytes.
>
> All tests run on a realtively loaded Sun Enterprise 4500 8 x 400Mhz sparc with
> 10Gbs ram.
>
> 1st test. Convert GenBank flat file to fasta.
>
> ########################################
> Bioperl parser code:
>
> #!/usr/local/bin/perl -w
>
> use Bio::SeqIO;
>
> $in = Bio::SeqIO->newFh(
> '-file' => "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |",
> '-format' => 'Genbank' );
>
> $out = Bio::SeqIO->newFh(
> '-file' => "> output1.txt",
> '-format' => 'fasta' );
>
> print $out $_ while <$in>;
>
> ########################################
> Time:
>
> real 17m4.008s
> user 15m19.500s
> sys 0m20.850s
>
>
> ########################################
> GenBankParser code:
>
> #!/usr/local/bin/perl -w
>
> use GenBankParser qw( DEFINITION LOCUS ORIGIN VERSION );
>
> open( FH, "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |" );
>
> (new GenBankParser)->parse_file( \*FH, sub {
> my $parser = shift;
>
> print ">gi|", $parser->VERSION->gi,
> "|gb|", $parser->VERSION->accession,
> "." , $parser->VERSION->version,
> "|" , $parser->LOCUS->locus,
> " " , $parser->DEFINITION, "\n";
>
> my $seq = $parser->ORIGIN;
> pos( $seq ) = 0;
>
> print $1, "\n" while $seq =~ m/\G (.{1,60}) /gx;
> } );
>
>
> ########################################
> Time:
>
> real 1m35.797s
> user 1m13.600s
> sys 0m6.350s
>
>
> Next program, dump out the locus identifer, accession number, molecular type,
> gi, and definition line.
>
> ########################################
> Bioperl code:
>
> #!/usr/local/bin/perl -w
>
> use Bio::SeqIO;
>
> $in = Bio::SeqIO->new(
> '-file' => "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |",
> '-format' => 'Genbank' );
>
>
> while (my $seq = $in->next_seq( )) {
>
> print $seq->display_id( ), "\n";
> print $seq->accession_number( ), "\n";
> print $seq->alphabet( ), "\n";
> print $seq->primary_id( ), "\n";
> print $seq->desc( ), "\n";
> }
>
> ########################################
> Time:
>
> real 16m29.771s
> user 14m43.560s
> sys 0m18.280s
>
>
> ########################################
> GenBankParser code:
>
> #!/usr/local/bin/perl -w
>
> use GenBankParser qw( DEFINITION LOCUS VERSION );
>
> open( FH, "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |" );
>
> (new GenBankParser)->parse_file( \*FH, sub {
> my $parser = shift;
>
> print $parser->LOCUS->locus, "\n";
> print $parser->VERSION->accession, "\n";
> print $parser->LOCUS->type, "\n";
> print $parser->VERSION->gi, "\n";
> print $parser->DEFINITION, "\n";
> } );
>
>
> ########################################
> Time:
>
> real 0m51.068s
> user 0m40.300s
> sys 0m3.580s
>
>
> Last program, print out gt2fasta format. That is protien in fasta format.
> First is GenBankParser code:
>
> #!/usr/local/bin/perl -w
>
> use strict;
> use GenBankParser qw( DEFINITION FEATURES SOURCE LOCUS );
> use GenBankParser::FEATURES qw( CDS );
>
> open( FH, "gzip -cd /am/db/gb-ff/gbbct.seq.gz |");
>
> (new GenBankParser)->parse_file( \*FH, sub {
> my $Parser = shift;
>
> my $cdscnt = 0;
> for my $cds (@{ $Parser->FEATURES->CDS }) {
>
> $cdscnt++;
>
> #
> # Skip pseudo genes
> #
> next if $cds->pseudo;
>
> my $definition =
> $cds->product ||
> $cds->gene ||
> (($cds->note) ? join "; ",
> $Parser->DEFINITION, $cds->note :
> $Parser->DEFINITION );
>
> my $organism = $Parser->SOURCE->organism;
>
> print ">gi|", $cds->gi,
> "|gp|", $cds->accession,
> "." , $cds->version,
> "|" , $Parser->LOCUS->locus,
> "_" , $cdscnt,
> " " , $definition,
> " [" , $organism,
> "]\n";
>
> my $seq = $cds->translation;
> pos( $seq ) = 0;
>
> print $1, "\n" while $seq =~ m/\G (.{1,58}) /gx;
> }
> });
>
>
> ########################################
> Time:
>
> real 3m58.278s
> user 2m54.370s
> sys 0m5.720s
>
> And BioPerl code. I just learned the bioperl api so I may not be doing things
> the most efficient way. If so please tell me an I'll rewrite and rerun my
> tests.
>
> #!/usr/local/bin/perl -w
>
> use Bio::SeqIO;
> use Bio::SeqFeature::Generic;
>
> $in = Bio::SeqIO->new(
> '-file' => "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |",
> '-format' => 'Genbank' );
>
> while (my $seq = $in->next_seq( )) {
>
> my $cdscnt = 0;
>
> foreach my $feature ( $seq->top_SeqFeatures ) {
>
> if ( $feature->primary_tag( ) eq 'CDS' ) {
>
> $cdscnt++;
>
> next if $feature->has_tag( 'pseudo' );
>
> my $definition = undef;
>
> if ( $feature->has_tag( 'product' ) ) {
> $definition = ($feature->each_tag_value( 'product' ))[0];
> }
> elsif ( $feature->has_tag( 'gene' ) ) {
> $definition = ($feature->each_tag_value( 'gene' ))[0];
> }
> elsif ( $feature->has_tag( 'note' ) ) {
> $definition = join "; ",
> $seq->desc,
> ($feature->each_tag_value( 'note' ))[0];
> }
> else {
> $definition = $seq->desc;
> }
>
> my $organism = $seq->species->species;
>
> my $accession = undef;
> my $version = undef;
> my $gi = undef;
>
> if ( $feature->has_tag( 'db_xref' ) ) {
> foreach my $tag ( $feature->each_tag_value( 'db_xref' ) ) {
> $gi = $1, last if $tag =~ /^GI:(\d+)$/;
> }
> }
>
> if ( $feature->has_tag( 'protein_id' ) ) {
> ($feature->each_tag_value( 'protein_id' ))[0] =~
> /^([^.]+)\.(\d+)$/;
>
> $accession = $1;
> $version = $2;
> }
>
> print ">gi|", $gi,
> "|gp|", $accession,
> "." , $version,
> "|" , $seq->display_id,
> "_" , $cdscnt,
> " " , $definition,
> " [" , $organism,
> "]\n";
>
> my $seq = ($feature->each_tag_value( 'translation' ))[0];
> pos ( $seq ) = 0;
>
> print $1, "\n" while $seq =~ m/\G (.{1,58}) /gx;
> }
> }
> }
>
> ########################################
> Time:
>
> real 17m17.897s
> user 15m23.400s
> sys 0m19.490s
>
>
> So in all cases the bioperl genbank parsers was 5 to 17 times as slow as the
> GenBankParser. I think that's pretty dramatic.
I'm the first to admit that the parser system needs an overhaul.
In work that Ewan and myself did to benchmark things we know that all of
our overhead that we've seen is in object creation not in the parsing
itsself. But that is always to be investigated further if someone gets
excited about this type of thing.
>
> I also feel that the GenBankParser interface is a little more intuitive--
> mirroring almost exactly the format of the actual flat file. Of course, I'm
Right but we've designed objects to be datasource independent. It is just
a different preference - Lincoln's BoulderIO essentially made objects
that were very tied to the data source This works for a certain set of
problems, but we're probably trying to solve slightly more general
problems with different data sources. The idea that we want the
same interfaces to the data regardless which format it is coming from.
Anyways I think this is good stuff - I've not look at the code, but I
think we do need someone to take an interest in seeing how to reduce
overhead of object creation if we want our parser to be faster.
> biased in this regard, being the author and all :).
>
> John Kloss.
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu