[Bioperl-l] Another problem with a joined feature

Fri Apr 18 14:39:15 UTC 2014

Warren,

It’s a good idea to point out records like this to the folks at NCBI, mainly b/c they’re not technically correct: the sequence in the file does not come from a protein translation from the record and does not contain a remote location to complete the translation shown.

chris

On Apr 17, 2014, at 7:55 PM, Jason Stajich <jason at bioperl.org> wrote:

> The problem is with NCBI here -
> 
> if you write back out the sequence you see the 1st exon is the only
> location for the feature.
> 
>   CDS             226..611
>                     /db_xref="GI:4028015"
>                     /codon_start=1
>                     /protein_id="AAC96101.1"
>                     /gene="KCNQ3"
> 
> If you turn on verbose and use the tempfile method
> 
> my $gb_db = Bio::DB::GenBank->new( -db => 'genbank',
>   -retrievaltype => 'tempfile',
>   -verbose => 1,
>   -rettype => 'gbwithparts' );
> 
> you will see the raw data that is being passed into SeqIO only has the
> single location.
> 
> Also you can see the raw file here:
> GET '
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=text&rettype=gbwithparts&db=nucleotide&tool=bioperl&id=AF071478.1&usehistory=n
> '
> 
> I am not sure right now how to tell NCBI to give you the remote features
> locations in the file it downloads but that is the problem not with
> spliced_seq.
> 
> Jason
> 
> 
> 
> Jason Stajich
> jason at bioperl.org
> http://bioperl.org/wiki/User:Jason
> http://twitter.com/hyphaltip
> 
> 
> On Thu, Apr 17, 2014 at 2:26 PM, Warren Gallin <wgallin at ualberta.ca> wrote:
> 
>> I have encountered another problem with several GenBank nucleotide
>> sequence records.
>> 
>> The exemplar is accession number AF071478.1 which is a sequence of a
>> specific human exon, annotated with a CDS feature that describes the whole
>> protein sequence  of which this exon anodes only a part.
>> 
>> The CDS feature is described as a join of pieces of sequence from this
>> record and several other records.  However, when I try to obtain the full
>> css I am only getting the part of the sequence from the AF071478.1 record,
>> not the others.
>> 
>> The following is a minimal script that illustrates this behaviour.
>> 
>> My understanding was that if a feature was described as a join of multiple
>> nucleotide sequence records then the spliced sequence() method with a
>> handle to a GenBank created using Bio::DB::GenBank should also collect and
>> splice in the fragments of nucleotide sequence from other GenBank records
>> to create the final spliced sequence.  Could someone explain where my
>> understanding is going awry?
>> 
>> Once again, thanks for any help.
>> 
>> Warren Gallin
>> 
>> ______________________________________________________________________
>> 
>> #!/usr/bin/perl
>> 
>> use strict;
>> use warnings;
>> use DBI;
>> use Bio::Seq;
>> use Bio::DB::EUtilities;
>> use Bio::SeqIO;
>> use Bio::Seq;
>> use Data::Printer;
>> use Bio::DB::GenBank;
>> 
>> #Test script for collecting full nucleic acid sequences from a GenBank
>> nucleotide
>> #sequence record that has a feature annotated as joins between parts of a
>> number of
>> #individual nucleotide records.
>> 
>> my @coded_by = ();
>> #Single example of a nucleotide sequence entry with a CDS feature that
>> includes joining
>> #with the sequences from several other records.
>> 
>> my $na_acc = 'AF071478.1';
>> 
>> #The corresponding protein record for the targetted CDS sequence
>> 
>> my $hit_acc = 'AAC96101.1';
>> 
>> #Create a database handle to GENBANK for retrieving coding sequences,
>> using the gbwithparts
>> #as return type to accomodate large records if they occur.
>> 
>> my $gb_db = Bio::DB::GenBank->new( -db => 'genbank', -rettype =>
>> 'gbwithparts' );
>> 
>> #Retrieve the genbank record for the test entry
>> 
>> my $temp_nuc_record = $gb_db->get_Seq_by_acc($na_acc);
>> 
>> foreach my $feat ( $temp_nuc_record->top_SeqFeatures ) {
>>    if ( $feat->primary_tag eq 'CDS' ) {
>>        eval {
>>                @coded_by = $feat->each_tag_value('protein_id');
>>        };
>>        if ($@) {
>>            print
>> "Could not find any protein_id tags for $hit_acc nucleotide $na_acc.\n
>> $@\n";
>>            exit;
>>        }
>>        if ( $coded_by[0] eq $hit_acc ) {
>>            my $na_seq = $feat->spliced_seq( -db => $gb_db )->seq;
>>            print "$na_seq\n";
>> 
>>        }
>>    }
>> }
>> exit;
>> 
>> 
>> ______________________________________________________________________________
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l