[Bioperl-l] question about Bio::DB::GenBank

Jason Stajich jason@chg.mc.duke.edu
Tue, 12 Jun 2001 14:05:49 -0400 (EDT)


Bob - 
[ this will be a longwinded answer in case someone else wants to jump in
and fight the good fight with me wrt fixing genbank retrieval]

In the future, to be able to accurately reproduce your reported bug we
typically need:

a) when does 'all of the sudden' mean, yesterday vs today?
b) what version of bioperl you are using
c) what is the output of
% perldoc -m Bio::SeqIO::genbank | grep '$Id' 

Feel free to submit things like this to the bug tracking interface at 
http://www.bioperl.org/Bugs/

Now, all that said, let's see about your problem, which is not due to a
specific module version as I soon found out.

Hmm this is a complete genome, needless to say seqs this big probably
never made it into our test cases.

Since this is like a refseq sequence, just references to other
sequences, this is not actually genbank format (at least what bioperl
expects). So that CONTIG part is what is tripping us the genbank parser.

However, if we were smarter about parsing, $seq->seq() wouldn't return
anything if you were to run this on the file stored locally because there
are only sequence id references, not actual sequence data in this report.

Now there is a workaround, you can fetch entire seq as a fasta.  You have
to forgo having annotation though.

$gb = new Bio::DB::GenBank(-format => 'fasta');

However this too does not work either, and it does work in the tests for
smaller files, so perhaps since this is a rather large sequence something
is falling out at ncbi.   

We've basically reverse engineered their web scripts as best as possible
to allow us to write nice little objects for retrieval, but since we don't
have access to any one help.

Now we should be able to use the EMBL server which provides a very easy to
query interface based on accession number.  For your purposes this
requires just using a different DB module (isn't OO programming nice...)

#!/usr/local/bin/perl
use Bio::DB::EMBL;
$gb = new Bio::DB::EMBL();
my($id) = "AE004439";
$seq = $gb->get_Seq_by_id($id);
print $seq->seq ."\n";

This unfortunately only returns a sequence 12376 bases long, while the
genbank record appears to be 2257487 bases long.  Hmm, I'm stuck and can't
dive in more deeply at this point.

This appears to be a combination of NCBI magic and some incompatibilities.
I guess this should be submitted into the bugtracking system soon.

-jason

On Tue, 12 Jun 2001, Bob Mangold wrote:

> I'm farily new to using bioperl, so maybe I'm missing something. I've been
> succesfully using the Genbank tools to fetch data, but all of a sudden it won't
> work. It appears that complete genome files in GenBank often contain a contig
> of other files as opposed to the entire sequence (for example see AE004439).
> Maybe I'm worng but it appears that Bio::DB::GenBank can't deal with this. 
> 
> Can someone confirm this and maybe point me to a work around.
> 
> Here is my code:
> 
> #!/usr/local/bin/perl
> use Bio::DB::GenBank;
> $gb = new Bio::DB::GenBank();
> my($id) = "AE004439";
> $seq = $gb->get_Seq_by_id($id);
> print $seq->seq ."\n";
> 
> The result is:
> 
> Can't call method "_generic_seqfeature" on an undefined value at
> /usr/local/lib/perl5/site_perl/5.6.0/Bio/SeqIO/genbank.pm line 272.
> 
> -Bob Mangold
> 
> __________________________________________________
> Do You Yahoo!?
> Get personalized email addresses from Yahoo! Mail - only $35 
> a year!  http://personal.mail.yahoo.com/
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> 

Jason Stajich
jason@chg.mc.duke.edu
Center for Human Genetics
Duke University Medical Center 
http://www.chg.duke.edu/