Bioperl: Extracting genes from GenBank flat file
Imre Vastrik
imre.vastrik@helsinki.fi
Wed, 14 Jun 2000 14:43:08 +0300
This is a multi-part message in MIME format.
--------------40D3893288EF2B4781A308AD
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sorry for joining this thread so late, but it was just now that I
noticed few things about using Bio::DB::GenBank module. In it's present
form (in bioperl-0.6.1 and live) this module fetches fasta-formatted
sequence. That's the reason why features/annotations are not there.
However, for getting the thing work properly one has to do just 4 small
changes:
on line 115 change 'dopt=f' to 'dopt=g'
on line 223 change 'FORMAT=1' to 'FORMAT=6'
on line 246 change 'Fasta' to 'genbank'
on line 277 change 'Fasta' to 'genbank'
I include the the modified version of GenBank.pm at as an attachment
(don't know if this is PC these days ;-)). Maybe someone with cvs
writing rights can put it to the appropriate location.
imre
--------------------------------------------
- Imre Vastrik, Ph.D., tel:+359-9-19126886 -
- http://www.hi.helsinki.fi/bioinfo -
--------------------------------------------
gert thijs wrote:
>
> I am trying to develop an some perl script to extract the location of CDS or
> genes in a DNA sequence based on the features described in the genbank flat
> file.
> I have two question:
> - Has anyone ever tried such things before? I haven't found anything browsing
> through the mail archive.
> - I use Bio::DB::GenBank to download a sequence from genbank, now I was
> wondering if the features in the genbank description are also downloaded and
> stored somewhere. I thought I could use the annotation object to extract this
> information but this seems not to be the case.
> I use something like this:
> $seq = $gb->get_Seq_by_acc($Accn[$i]);
> $ann = $seq->annotation();
> print $ann->description();
>
> But then $ann->description() seems to be an empty string.
--------------40D3893288EF2B4781A308AD
Content-Type: text/plain; charset=us-ascii;
name="GenBank.pm"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="GenBank.pm"
# test
#
# BioPerl module for Bio::DB::GenBank
#
# Cared for by Aaron Mackey <amackey@virginia.edu>
#
# Copyright Aaron Mackey
#
# You may distribute this module under the same terms as perl itself
# POD documentation - main docs before the code
=head1 NAME
Bio::DB::GenBank - Database object interface to GenBank
=head1 SYNOPSIS
$gb = new Bio::DB::GenBank;
$seq = $gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID
# or ...
$seq = $gb->get_Seq_by_acc('J00522'); # Accession Number
=head1 DESCRIPTION
Allows the dynamic retrieval of Sequence objects (Bio::Seq) from the GenBank
database at NCBI, via an Entrez query.
WARNING: Please do NOT spam the Entrez web server with multiple requests.
NCBI offers Batch Entrez for this purpose. Batch Entrez support will likely
be supported in a future version of DB::GenBank.
=head1 FEEDBACK
=head2 Mailing Lists
User feedback is an integral part of the
evolution of this and other Bioperl modules. Send
your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation
is much appreciated.
vsns-bcd-perl@lists.uni-bielefeld.de - General discussion
vsns-bcd-perl-guts@lists.uni-bielefeld.de - Technically-oriented discussion
http://bio.perl.org/MailList.html - About the mailing lists
=head2 Reporting Bugs
Report bugs to the Bioperl bug tracking system to
help us keep track the bugs and their resolution.
Bug reports can be submitted via email or the
web:
bioperl-bugs@bio.perl.org
http://bio.perl.org/bioperl-bugs/
=head1 AUTHOR - Aaron Mackey
Email amackey@virginia.edu
=head1 APPENDIX
The rest of the documentation details each of the
object methods. Internal methods are usually
preceded with a _
=cut
# Let the code begin...
package Bio::DB::GenBank;
use strict;
use vars qw(@ISA);
# Object preamble - inherits from Bio::DB::RandomAccessI
use Bio::DB::RandomAccessI;
use Bio::SeqIO;
use IO::Socket;
use IO::File;
@ISA = qw(Bio::Root::Object Bio::DB::RandomAccessI);
# new() is inherited from Bio::Root::Object
# _initialize is where the heavy stuff will happen when new is called
sub _initialize {
my($self,@args) = @_;
my $make = $self->SUPER::_initialize;
# set stuff in self from @args
return $make; # success - we hope!
}
=head2 get_Seq_by_id
Title : get_Seq_by_id
Usage : $seq = $db->get_Seq_by_id($uid);
Function: Gets a Bio::Seq object by its unique identifier/name
Returns : a Bio::Seq object
Args : $uid : the id (as a string) of the desired sequence entry
=cut
sub get_Seq_by_id {
my $self = shift;
my $uid = shift or $self->throw("Must supply an identifier!\n");
my $entrez = "db=n&form=6&dopt=g&html=no&title=no&uid=$uid";
my $stream = $self->_get_stream($entrez);
my $seq = $stream->next_seq();
$self->throw("Unable to get seq for id $uid, is it really a genbank id?\n")
if ( !defined $seq );
return $seq;
}
=head2 get_Seq_by_acc
Title : get_Seq_by_acc
Usage : $seq = $db->get_Seq_by_acc($acc);
Function: Gets a Bio::Seq object by its accession number
Returns : a Bio::Seq object
Args : $acc : the accession number of the desired sequence entry
Note : For GenBank, this just calls the same code for get_Seq_by_id()
=cut
sub get_Seq_by_acc {
my $self = shift;
my $acc = shift or $self->throw("Must supply an accesion number!\n");
return $self->get_Seq_by_id($acc);
}
=head2 get_Stream_by_id
Title : get_Stream_by_id
Usage : $stream = $db->get_Stream_by_id( [$uid1, $uid2] );
Function: Gets a series of Seq objects by unique identifiers
Returns : a Bio::SeqIO stream object
Args : $ref : a reference to an array of unique identifiers for
the desired sequence entries
=cut
sub get_Stream_by_id {
my $self = shift;
my $id = shift or $self->throw("Must supply a unique identifier!\n");
ref($id) eq "ARRAY" or $self->throw("Must supply an array ref!\n");
my $uid = join(',', @{$id});
my $entrez = "db=n&form=6&dopt=f&html=no&title=no&uid=$uid" ;
return $self->_get_stream($entrez);
}
=head2 get_Stream_by_acc
Title : get_Stream_by_acc
Usage : $seq = $db->get_Seq_by_acc($acc);
Function: Gets a series of Seq objects by accession numbers
Returns : a Bio::SeqIO stream object
Args : $ref : a reference to an array of accession numbers for
the desired sequence entries
Note : For GenBank, this just calls the same code for get_Stream_by_id()
=cut
sub get_Stream_by_acc {
my $self = shift;
my $acc = shift or $self->throw("Must supply an accession number!\n");
return $self->get_Seq_by_id($acc);
}
=head2 get_Stream_by_batch
Title : get_Stream_by_batch
Usage : $seq = $db->get_Stream_by_batch($ref);
Function: Retrieves Seq objects from Entrez 'en masse', rather than one
at a time. For large numbers of sequences, this is far superior
than get_Stream_by_[id/acc]().
Example :
Returns : a Bio::SeqIO stream object
Args : $ref : either an array reference, a filename, or a filehandle
from which to get the list of unique id's/accession numbers.
=cut
sub get_Stream_by_batch {
my $self = shift;
my $ref = shift or $self->throw("Must supply an argument!\n");
my $which = ref($ref);
my $fh;
my $filename;
if ( $which eq 'ARRAY') { # $ref is an array reference
$fh = new_tmpfile IO::File;
for ( @{$ref} ) {
print $fh $_ . "\n";
}
seek $fh, 0, 0;
$filename = "tempfile.txt";
} elsif ( $which eq '') { # $ref is a filename
$fh = new IO::File $ref, "r";
$filename = $ref;
} elsif ( $which eq 'GLOB' or $which eq 'IO::File') { # $ref is assumed to be a filehandle
$fh = $ref;
$filename = "tempfile.txt";
}
my $wwwbuf = "DB=n&REQUEST_TYPE=LIST_OF_GIS&FORMAT=6&HTML=FALSE&SAVETO=FALSE&NOHEADER=TRUE&UID=" . join(',', grep { chomp; } <$fh> );
my $sock = $self->_get_sock();
select $sock;
print "POST /cgi-bin/Entrez/qserver.cgi HTTP/1.0\015\012";
print "Host: www.ncbi.nlm.nih.gov\015\012";
print "User-Agent: $0::Bio::DB::GenBank\015\012";
print "Connection: Keep-Alive\015\012";
print "Content-type: application/x-www-form-urlencoded\015\012";
print "Content-length: " . length($wwwbuf) . "\015\012";
print "\015\012";
print $wwwbuf;
while (<$sock>) {
if ( m,^HTTP/\d+\.\d+\s+(\d+)[^\012]\012, ) {
my $code = $1;
return undef unless $code =~ /^2/;
}
$self->throw("Entrez Error - check query sequences!\n") if m/^ERROR/i;
last if m/Batch Entrez results/;
}
return Bio::SeqIO->new('-fh' => $sock, '-format' => 'genbank');
}
sub _get_stream {
my($self, $entrez) = @_;
# most of this socket stuff is borrowed heavily from LWP::Simple, by
# Gisle Aas and Martijn Koster. They copyleft'ed it, but we should give
# them full credit for this little diddy.
my $sock = $self->_get_sock();
print $sock join("\015\012" =>
"GET /htbin-post/Entrez/query?$entrez HTTP/1.0",
"Host: www.ncbi.nlm.nih.gov",
"User-Agent: $0::Bio::DB::GenBank",
"", "");
while(<$sock>) {
if ( m,^HTTP/\d+\.\d+\s+(\d+)[^\012]\012, ) {
my $code = $1;
return undef unless $code =~ /^2/;
}
$self->throw("Entrez Error - check query sequences!\n") if m/^ERROR/i;
last if m/^------/; # Kludgy, but it's how L. Stein does Boulder too
}
return Bio::SeqIO->new('-fh' => $sock, '-format' => 'genbank');
}
sub _get_sock {
my $self = shift;
my $sock = IO::Socket::INET->new(PeerAddr => 'www.ncbi.nlm.nih.gov',
PeerPort => 80,
Proto => 'tcp',
Timeout => 60
);
unless ($sock) {
$@ =~ s/^.*?: //;
$self->throw("Can't connect to GenBank ($@)\n");
}
$sock->autoflush(); # just for safety's sake if they have old IO::Socket
return $sock;
}
1;
__END__
--------------40D3893288EF2B4781A308AD--
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================