[Bioperl-l] Another problem with a joined feature
Jason Stajich
jason at bioperl.org
Fri Apr 18 00:55:29 UTC 2014
The problem is with NCBI here -
if you write back out the sequence you see the 1st exon is the only
location for the feature.
CDS 226..611
/db_xref="GI:4028015"
/codon_start=1
/protein_id="AAC96101.1"
/gene="KCNQ3"
If you turn on verbose and use the tempfile method
my $gb_db = Bio::DB::GenBank->new( -db => 'genbank',
-retrievaltype => 'tempfile',
-verbose => 1,
-rettype => 'gbwithparts' );
you will see the raw data that is being passed into SeqIO only has the
single location.
Also you can see the raw file here:
GET '
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=text&rettype=gbwithparts&db=nucleotide&tool=bioperl&id=AF071478.1&usehistory=n
'
I am not sure right now how to tell NCBI to give you the remote features
locations in the file it downloads but that is the problem not with
spliced_seq.
Jason
Jason Stajich
jason at bioperl.org
http://bioperl.org/wiki/User:Jason
http://twitter.com/hyphaltip
On Thu, Apr 17, 2014 at 2:26 PM, Warren Gallin <wgallin at ualberta.ca> wrote:
> I have encountered another problem with several GenBank nucleotide
> sequence records.
>
> The exemplar is accession number AF071478.1 which is a sequence of a
> specific human exon, annotated with a CDS feature that describes the whole
> protein sequence of which this exon anodes only a part.
>
> The CDS feature is described as a join of pieces of sequence from this
> record and several other records. However, when I try to obtain the full
> css I am only getting the part of the sequence from the AF071478.1 record,
> not the others.
>
> The following is a minimal script that illustrates this behaviour.
>
> My understanding was that if a feature was described as a join of multiple
> nucleotide sequence records then the spliced sequence() method with a
> handle to a GenBank created using Bio::DB::GenBank should also collect and
> splice in the fragments of nucleotide sequence from other GenBank records
> to create the final spliced sequence. Could someone explain where my
> understanding is going awry?
>
> Once again, thanks for any help.
>
> Warren Gallin
>
> ______________________________________________________________________
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
> use DBI;
> use Bio::Seq;
> use Bio::DB::EUtilities;
> use Bio::SeqIO;
> use Bio::Seq;
> use Data::Printer;
> use Bio::DB::GenBank;
>
> #Test script for collecting full nucleic acid sequences from a GenBank
> nucleotide
> #sequence record that has a feature annotated as joins between parts of a
> number of
> #individual nucleotide records.
>
> my @coded_by = ();
> #Single example of a nucleotide sequence entry with a CDS feature that
> includes joining
> #with the sequences from several other records.
>
> my $na_acc = 'AF071478.1';
>
> #The corresponding protein record for the targetted CDS sequence
>
> my $hit_acc = 'AAC96101.1';
>
> #Create a database handle to GENBANK for retrieving coding sequences,
> using the gbwithparts
> #as return type to accomodate large records if they occur.
>
> my $gb_db = Bio::DB::GenBank->new( -db => 'genbank', -rettype =>
> 'gbwithparts' );
>
> #Retrieve the genbank record for the test entry
>
> my $temp_nuc_record = $gb_db->get_Seq_by_acc($na_acc);
>
> foreach my $feat ( $temp_nuc_record->top_SeqFeatures ) {
> if ( $feat->primary_tag eq 'CDS' ) {
> eval {
> @coded_by = $feat->each_tag_value('protein_id');
> };
> if ($@) {
> print
> "Could not find any protein_id tags for $hit_acc nucleotide $na_acc.\n
> $@\n";
> exit;
> }
> if ( $coded_by[0] eq $hit_acc ) {
> my $na_seq = $feat->spliced_seq( -db => $gb_db )->seq;
> print "$na_seq\n";
>
> }
> }
> }
> exit;
>
>
> ______________________________________________________________________________
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
More information about the Bioperl-l
mailing list