RefSeq retrieval Re: [Bioperl-l] Help on retrieving NT contigs with Bio::DB::GenBank.

Heikki Lehvaslaiho heikki@ebi.ac.uk
Tue, 18 Dec 2001 10:05:13 +0000


Jason Eric Stajich wrote:
> 
> Kun-
> 
> Has to do with NCBI retrieve of RefSeq contig (NT_*)  NCBI does not
> implement this in the same way as 'regular' accession numbers.
> 
> This is discussed in previous messages in the list and should make its way
> into a wiki FAQ at some point if someone wants to jump on it.
> 
> We're currently rethinking how to best provide this functionality as we
> are essentially limited by NCBI not providing a single CGI-BIN which maps
> to our simple Bio::DB::RandomAccessI interface for all valid accession
> numbers.  Ideas and volunteers to take this on welcomed.  Mathew Wiepert
> at Mayo started to look a it.  I suspect he got to the same point I did
> which meant parsing HTML and doing a 2-step retrieval method, at which
> point I balked.

The other way of approaching this problem is serve the data from somewhere
else without complications of HTML. That is what I wrote dbfetch script for.
Rodrigo Lopez added RefSeq into the EBI SRS server yesterday and I modified
the local dbfetch scrip to include it.

I've just committed Bio::DB::RefSeq. At the same time I created
Bio::DB::DBFetch which implements all the code for Bio::DB::WebDBSeqI and
acts as a superclass to both EMBL and RefSeq modules. Adding a new database
into system is now really simple: one just sets the default values. 

There are two caveats here:

1. The EBI copy of RefSeq is not yet automatically updated. This will be set
up after Christmas.

2. Reading RefSeq entries in depends on genbank parser. The RefSeq format
mimics closely Genbank but differs in multiple minor ways from it in various
entries. As far as I know there are no documents for the format so there is
not much we can do to this - except hope for the best when retrieving
individual entries.


Have fun,

	-Heikki

> -jason
> 
> On Fri, 30 Nov 2001, Kun Zhang wrote:
> 
> > Hello!
> >
> > I got a error message (attached below) when trying to retrieving some NT
> > contigs from GenBank with the Bio::DB::Genbank module. It looks like the
> > problem occurs only on NT sequence because the getGenBank.pl came with the
> > bioperl-0.9.0 distribution works fine. And my perl script works when I
> > replacing the "NT_001035" with "AF303112". Can anyone help me out? Thanks!
> >
> > Kun Zhang
> > Human Genetics Center
> > University of Texas-Houston
> >
> > ------------------------My codes-----------------------------
> > my $gb = new Bio::DB::GenBank;
> > $gb->request_format('fasta') ;
> > $contigSeq = $gb->get_Seq_by_acc('NT_001035');
> >
> >
> >
> > ==============================ERROR MESSAGE==================================
> > -------------------- EXCEPTION --------------------
> > MSG: Attempting to set the sequence to [<HTML] which does not look healthy
> > STACK Bio::PrimarySeq::seq
> > /usr/local/lib/perl5/site_perl/5.6.1/Bio/PrimarySeq.pm:251
> > STACK Bio::PrimarySeq::new
> > /usr/local/lib/perl5/site_perl/5.6.1/Bio/PrimarySeq.pm:226
> > STACK Bio::Seq::new /usr/local/lib/perl5/site_perl/5.6.1/Bio/Seq.pm:132
> > STACK Bio::SeqIO::fasta::next_primary_seq
> > /usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/fasta.pm:130
> > STACK Bio::SeqIO::fasta::next_seq
> > /usr/local/lib/perl5/site_perl/5.6.1/Bio/SeqIO/fasta.pm:85
> > STACK Bio::DB::WebDBSeqI::get_Seq_by_acc
> > /usr/local/lib/perl5/site_perl/5.6.1/Bio/DB/WebDBSeqI.pm:159
> > STACK toplevel ./splitSeq.pl:26
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
> 
> --
> Jason Stajich
> Duke University
> jason@cgt.mc.duke.edu
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/                      http://www.ebi.ac.uk/mutations/
     _/  _/  _/  Heikki Lehvaslaiho          heikki@ebi.ac.uk
    _/_/_/_/_/  EMBL Outstation, European Bioinformatics Institute
   _/  _/  _/  Wellcome Trust Genome Campus, Hinxton
  _/  _/  _/  Cambs. CB10 1SD, United Kingdom
     _/      Phone: +44 (0)1223 494 644   FAX: +44 (0)1223 494 468
___ _/_/_/_/_/________________________________________________________