Bioperl: Extracting genes from GenBank flat file

Wed, 14 Jun 2000 14:43:08 +0300

This is a multi-part message in MIME format.
--------------40D3893288EF2B4781A308AD
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Sorry for joining this thread so late, but it was just now that I
noticed few things about using Bio::DB::GenBank module. In it's present
form (in bioperl-0.6.1 and live) this module fetches fasta-formatted
sequence. That's the reason why features/annotations are not there.
However, for getting the thing work properly one has to do just 4 small
changes:

on line 115 change 'dopt=f' to 'dopt=g'
on line 223 change 'FORMAT=1' to 'FORMAT=6'
on line 246 change 'Fasta' to 'genbank'
on line 277 change 'Fasta' to 'genbank'

I include the the modified version of GenBank.pm at as an attachment
(don't know if this is PC these days ;-)). Maybe someone with cvs
writing rights can put it to the appropriate location.

imre

--------------------------------------------
- Imre Vastrik, Ph.D., tel:+359-9-19126886 -
-    http://www.hi.helsinki.fi/bioinfo     -
--------------------------------------------

gert thijs wrote:
> 
> I am trying to develop an some perl script to extract the location of CDS or
> genes in a DNA sequence based on the features described in the genbank flat
> file.
> I have two question:
> - Has anyone ever tried such things before? I haven't found anything browsing
> through the mail archive.
> - I use Bio::DB::GenBank to download a sequence from genbank, now I was
> wondering if the features in the genbank description are also downloaded and
> stored somewhere. I thought I could use the annotation object to extract this
> information but this seems not to be the case.
> I use something like this:
> $seq = $gb->get_Seq_by_acc($Accn[$i]);
> $ann = $seq->annotation();
> print $ann->description();
> 
> But then $ann->description() seems to be an empty string.
--------------40D3893288EF2B4781A308AD
Content-Type: text/plain; charset=us-ascii;
 name="GenBank.pm"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="GenBank.pm"

# test
#
# BioPerl module for Bio::DB::GenBank
#
# Cared for by Aaron Mackey <amackey@virginia.edu>
#
# Copyright Aaron Mackey
#
# You may distribute this module under the same terms as perl itself

# POD documentation - main docs before the code

=head1 NAME

Bio::DB::GenBank - Database object interface to GenBank

=head1 SYNOPSIS

    $gb = new Bio::DB::GenBank;

    $seq = $gb->get_Seq_by_id('MUSIGHBA1'); # Unique ID

    # or ...

    $seq = $gb->get_Seq_by_acc('J00522'); # Accession Number

=head1 DESCRIPTION

Allows the dynamic retrieval of Sequence objects (Bio::Seq) from the GenBank
database at NCBI, via an Entrez query.

WARNING: Please do NOT spam the Entrez web server with multiple requests.
NCBI offers Batch Entrez for this purpose.  Batch Entrez support will likely
be supported in a future version of DB::GenBank.

=head1 FEEDBACK

=head2 Mailing Lists

User feedback is an integral part of the
evolution of this and other Bioperl modules. Send
your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation
is much appreciated.

  vsns-bcd-perl@lists.uni-bielefeld.de          - General discussion
  vsns-bcd-perl-guts@lists.uni-bielefeld.de     - Technically-oriented discussion
  http://bio.perl.org/MailList.html             - About the mailing lists

=head2 Reporting Bugs

Report bugs to the Bioperl bug tracking system to
help us keep track the bugs and their resolution.
Bug reports can be submitted via email or the
web:

  bioperl-bugs@bio.perl.org
  http://bio.perl.org/bioperl-bugs/

=head1 AUTHOR - Aaron Mackey

Email amackey@virginia.edu

=head1 APPENDIX

The rest of the documentation details each of the
object methods. Internal methods are usually
preceded with a _

=cut

# Let the code begin...

package Bio::DB::GenBank;
use strict;
use vars qw(@ISA);

# Object preamble - inherits from Bio::DB::RandomAccessI

use Bio::DB::RandomAccessI;
use Bio::SeqIO;
use IO::Socket;
use IO::File;

@ISA = qw(Bio::Root::Object Bio::DB::RandomAccessI);

# new() is inherited from Bio::Root::Object

# _initialize is where the heavy stuff will happen when new is called

sub _initialize {
  my($self,@args) = @_;

  my $make = $self->SUPER::_initialize;

# set stuff in self from @args  
  return $make; # success - we hope!
}

=head2 get_Seq_by_id

 Title   : get_Seq_by_id
 Usage   : $seq = $db->get_Seq_by_id($uid);
 Function: Gets a Bio::Seq object by its unique identifier/name
 Returns : a Bio::Seq object
 Args    : $uid : the id (as a string) of the desired sequence entry

=cut

sub get_Seq_by_id {

  my $self = shift;
  my $uid = shift or $self->throw("Must supply an identifier!\n");

  my $entrez = "db=n&form=6&dopt=g&html=no&title=no&uid=$uid";

  my $stream = $self->_get_stream($entrez);
  my $seq = $stream->next_seq();
  $self->throw("Unable to get seq for id $uid, is it really a genbank id?\n") 
      if ( !defined $seq );
  return $seq;
}

=head2 get_Seq_by_acc

  Title   : get_Seq_by_acc
  Usage   : $seq = $db->get_Seq_by_acc($acc);
  Function: Gets a Bio::Seq object by its accession number
  Returns : a Bio::Seq object
  Args    : $acc : the accession number of the desired sequence entry
  Note    : For GenBank, this just calls the same code for get_Seq_by_id()

=cut

sub get_Seq_by_acc {

  my $self = shift;
  my $acc = shift or $self->throw("Must supply an accesion number!\n");

  return $self->get_Seq_by_id($acc);
}

=head2 get_Stream_by_id

  Title   : get_Stream_by_id
  Usage   : $stream = $db->get_Stream_by_id( [$uid1, $uid2] );
  Function: Gets a series of Seq objects by unique identifiers
  Returns : a Bio::SeqIO stream object
  Args    : $ref : a reference to an array of unique identifiers for
                   the desired sequence entries

=cut

sub get_Stream_by_id {

  my $self = shift;
  my $id = shift or $self->throw("Must supply a unique identifier!\n");
  ref($id) eq "ARRAY" or $self->throw("Must supply an array ref!\n");

  my $uid = join(',', @{$id});
  my $entrez = "db=n&form=6&dopt=f&html=no&title=no&uid=$uid" ;

  return $self->_get_stream($entrez);

}

=head2 get_Stream_by_acc

  Title   : get_Stream_by_acc
  Usage   : $seq = $db->get_Seq_by_acc($acc);
  Function: Gets a series of Seq objects by accession numbers
  Returns : a Bio::SeqIO stream object
  Args    : $ref : a reference to an array of accession numbers for
                   the desired sequence entries
  Note    : For GenBank, this just calls the same code for get_Stream_by_id()

=cut

sub get_Stream_by_acc {

  my $self = shift;
  my $acc = shift or $self->throw("Must supply an accession number!\n");

  return $self->get_Seq_by_id($acc);
}

=head2 get_Stream_by_batch

  Title   : get_Stream_by_batch
  Usage   : $seq = $db->get_Stream_by_batch($ref);
  Function: Retrieves Seq objects from Entrez 'en masse', rather than one
            at a time.  For large numbers of sequences, this is far superior
            than get_Stream_by_[id/acc]().
  Example :
  Returns : a Bio::SeqIO stream object
  Args    : $ref : either an array reference, a filename, or a filehandle
            from which to get the list of unique id's/accession numbers.

=cut

sub get_Stream_by_batch {
   my $self = shift;
   my $ref = shift or $self->throw("Must supply an argument!\n");
   my $which = ref($ref);
   my $fh;
   my $filename;
   if ( $which eq 'ARRAY') { # $ref is an array reference
       $fh = new_tmpfile IO::File;
       for ( @{$ref} ) {
	   print $fh $_ . "\n";
       }
       seek $fh, 0, 0;
       $filename = "tempfile.txt";
   } elsif ( $which eq '') { # $ref is a filename
       $fh = new IO::File $ref, "r";
       $filename = $ref;
   } elsif ( $which eq 'GLOB' or $which eq 'IO::File') { # $ref is assumed to be a filehandle
       $fh = $ref;
       $filename = "tempfile.txt";
   }

   my $wwwbuf = "DB=n&REQUEST_TYPE=LIST_OF_GIS&FORMAT=6&HTML=FALSE&SAVETO=FALSE&NOHEADER=TRUE&UID=" . join(',', grep { chomp; } <$fh> );

   my $sock = $self->_get_sock();

   select $sock;
   print "POST /cgi-bin/Entrez/qserver.cgi HTTP/1.0\015\012";
   print "Host: www.ncbi.nlm.nih.gov\015\012";
   print "User-Agent: $0::Bio::DB::GenBank\015\012";
   print "Connection: Keep-Alive\015\012";
   print "Content-type: application/x-www-form-urlencoded\015\012";
   print "Content-length: " . length($wwwbuf) . "\015\012";
   print "\015\012";
   print $wwwbuf;

   while (<$sock>) {
       if ( m,^HTTP/\d+\.\d+\s+(\d+)[^\012]\012, ) {
	   my $code = $1;
	   return undef unless $code =~ /^2/;
       }
       $self->throw("Entrez Error - check query sequences!\n") if m/^ERROR/i;
       last if m/Batch Entrez results/;
   }

   return Bio::SeqIO->new('-fh' => $sock, '-format' => 'genbank');

}

sub _get_stream {

  my($self, $entrez) = @_;

# most of this socket stuff is borrowed heavily from LWP::Simple, by
# Gisle Aas and Martijn Koster.  They copyleft'ed it, but we should give
# them full credit for this little diddy.

  my $sock = $self->_get_sock();

  print $sock join("\015\012" =>
		   "GET /htbin-post/Entrez/query?$entrez HTTP/1.0",
		   "Host: www.ncbi.nlm.nih.gov",
		   "User-Agent: $0::Bio::DB::GenBank",
		   "", "");

  while(<$sock>) {
    if ( m,^HTTP/\d+\.\d+\s+(\d+)[^\012]\012, ) {
      my $code = $1;
      return undef unless $code =~ /^2/;
    }
    $self->throw("Entrez Error - check query sequences!\n") if m/^ERROR/i;
    last if m/^------/; # Kludgy, but it's how L. Stein does Boulder too
  }

  return Bio::SeqIO->new('-fh' => $sock, '-format' => 'genbank');

}

sub _get_sock {
    my $self = shift;
    my $sock = IO::Socket::INET->new(PeerAddr => 'www.ncbi.nlm.nih.gov',
				     PeerPort => 80,
				     Proto    => 'tcp',
				     Timeout  => 60
				     );
    unless ($sock) {
	$@ =~ s/^.*?: //;
	$self->throw("Can't connect to GenBank ($@)\n");
    }
    $sock->autoflush();		# just for safety's sake if they have old IO::Socket

    return $sock;
}

1;
__END__

--------------40D3893288EF2B4781A308AD--

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================