[Bioperl-l] GenBank Parser
Ewan Birney
birney@ebi.ac.uk
Tue, 31 Dec 2002 09:35:20 +0000 (GMT)
On Mon, 30 Dec 2002, Drew Stewart wrote:
> Hi Everybody,
> I have just joined the mailing list and kind of new to
> Bioperl.
The following script will work with bioperl 1.0 onwards (available at
http://www.bioperl.org and CPAN)
use Bio::SeqIO;
# this is the input stream - genbank
$seqin = Bio::SeqIO->new( -format => 'genbank', -file => 'myfiletoread.gb');
# this loop loops over all sequenecs
while( (my $seq = $seqin->next_seq()) ) {
foreach my $sf ( $seq->get_SeqFeatures() ) {
if( $sf->primary_tag eq 'CDS' ) {
print "Coding sequence for ",$seq->display_id,
" [",$seq->accession_number,"] extent is from ",
$sf->start," to ",$sf->end,"\n";
# test whether this location has a join
if( $sf->location->isa("Bio::Location::SplitLocationI") ) {
print " Built from...\n";
foreach my $sub ( $sf->location->sub_Location ) {
print " Sub location ",$sub->start," ",$sub->end,
" strand: ",$sub->strand,"\n";
}
}
}
}
}
For me, this prints out lines like:
Coding sequence for U58726 [U58726] extent is from 28176 to 31889
Built from...
Sub location 28176 28361 strand: -1
Sub location 28501 28576 strand: -1
Sub location 28625 28753 strand: -1
Sub location 28811 28871 strand: -1
Sub location 28916 29030 strand: -1
Sub location 29154 29249 strand: -1
Sub location 29301 29473 strand: -1
Sub location 29567 29655 strand: -1
Sub location 29698 29855 strand: -1
Sub location 29903 30051 strand: -1
Sub location 30149 30309 strand: -1
Sub location 30526 30637 strand: -1
Sub location 30690 31052 strand: -1
Sub location 31100 31355 strand: -1
Sub location 31716 31889 strand: -1
The 1.2 release (due out today!) has one or two additional things to make
your life easier when you use this, but this will work from 1.0 onwards.
> Here is my problem,
> I am trying to write a parser for the GenBank file
> obtained from the NCBI website, specifically for the
> "CDS" feature in the GenBank file.
> I am trying to parse the range given in the "CDS"
> feature to get the nucleotide subsequence from the
> whole genome for a specific protein.
> For example here is a portion of the genbank file I am
> interested in parsing.
>
> CDS complement(join(12..78,54..1043))
>
> or
>
> CDS join(complement(<1..799),complement(5080..5120))
>
> I want to parse this and other possible formats
> "join(complement(<1..799),complement(5080..5120))"
>
> I have seen in the GenBank readme file that there are
> many other possible formats for this CDS feature line
> and so I was wondering if somebody has already written
> a parser for this.
>
> Can anyone please suggest some things I can use.
> It would be a great help. Thank you.
>
> Sincerely,
> Dhruv Bhatt.
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
> http://mailplus.yahoo.com
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>