[Bioperl-l] GenBank Parser

Ewan Birney birney@ebi.ac.uk
Tue, 31 Dec 2002 09:35:20 +0000 (GMT)


On Mon, 30 Dec 2002, Drew Stewart wrote:

> Hi Everybody,
> I have just joined the mailing list and kind of new to
> Bioperl.

The following script will work with bioperl 1.0 onwards (available at
http://www.bioperl.org and CPAN)

use Bio::SeqIO;

# this is the input stream - genbank
$seqin = Bio::SeqIO->new( -format => 'genbank', -file => 'myfiletoread.gb');


# this loop loops over all sequenecs
while( (my $seq = $seqin->next_seq()) ) {
    foreach my $sf ( $seq->get_SeqFeatures() ) {
        if( $sf->primary_tag eq 'CDS' ) {
        print "Coding sequence for ",$seq->display_id,
         " [",$seq->accession_number,"] extent is from ",
        $sf->start," to ",$sf->end,"\n";

        # test whether this location has a join
        if( $sf->location->isa("Bio::Location::SplitLocationI") ) {
            print " Built from...\n";
            foreach my $sub ( $sf->location->sub_Location ) {
             print "  Sub location ",$sub->start," ",$sub->end,
             " strand: ",$sub->strand,"\n";
         }
        }
    }
  }
}



For me, this prints out lines like:


Coding sequence for U58726 [U58726] extent is from 28176 to 31889
 Built from...
  Sub location 28176 28361 strand: -1
  Sub location 28501 28576 strand: -1
  Sub location 28625 28753 strand: -1
  Sub location 28811 28871 strand: -1
  Sub location 28916 29030 strand: -1
  Sub location 29154 29249 strand: -1
  Sub location 29301 29473 strand: -1
  Sub location 29567 29655 strand: -1
  Sub location 29698 29855 strand: -1
  Sub location 29903 30051 strand: -1
  Sub location 30149 30309 strand: -1
  Sub location 30526 30637 strand: -1
  Sub location 30690 31052 strand: -1
  Sub location 31100 31355 strand: -1
  Sub location 31716 31889 strand: -1




The 1.2 release (due out today!) has one or two additional things to make
your life easier when you use this, but this will work from 1.0 onwards.





> Here is my problem,
> I am trying to write a parser for the GenBank file
> obtained from the NCBI website, specifically for the
> "CDS" feature in the GenBank file.
> I am trying to parse the range given in the "CDS"
> feature to get the nucleotide subsequence from the
> whole genome for a specific protein.
> For example here is a portion of the genbank file I am
> interested in parsing.
>
>   CDS   complement(join(12..78,54..1043))
>
> or
>
> CDS   join(complement(<1..799),complement(5080..5120))
>
> I want to parse  this and other possible formats
> "join(complement(<1..799),complement(5080..5120))"
>
> I have seen in the GenBank readme file that there are
> many other possible formats for this CDS feature line
> and so I was wondering if somebody has already written
> a parser for this.
>
> Can anyone please suggest some things I can use.
> It would be a great help. Thank you.
>
> Sincerely,
> Dhruv Bhatt.
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>