[Bioperl-l] Another problem with a joined feature
Fields, Christopher J
cjfields at illinois.edu
Fri Apr 18 14:39:15 UTC 2014
Warren,
It’s a good idea to point out records like this to the folks at NCBI, mainly b/c they’re not technically correct: the sequence in the file does not come from a protein translation from the record and does not contain a remote location to complete the translation shown.
chris
On Apr 17, 2014, at 7:55 PM, Jason Stajich <jason at bioperl.org> wrote:
> The problem is with NCBI here -
>
> if you write back out the sequence you see the 1st exon is the only
> location for the feature.
>
> CDS 226..611
> /db_xref="GI:4028015"
> /codon_start=1
> /protein_id="AAC96101.1"
> /gene="KCNQ3"
>
> If you turn on verbose and use the tempfile method
>
> my $gb_db = Bio::DB::GenBank->new( -db => 'genbank',
> -retrievaltype => 'tempfile',
> -verbose => 1,
> -rettype => 'gbwithparts' );
>
> you will see the raw data that is being passed into SeqIO only has the
> single location.
>
> Also you can see the raw file here:
> GET '
> http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=text&rettype=gbwithparts&db=nucleotide&tool=bioperl&id=AF071478.1&usehistory=n
> '
>
> I am not sure right now how to tell NCBI to give you the remote features
> locations in the file it downloads but that is the problem not with
> spliced_seq.
>
> Jason
>
>
>
> Jason Stajich
> jason at bioperl.org
> http://bioperl.org/wiki/User:Jason
> http://twitter.com/hyphaltip
>
>
> On Thu, Apr 17, 2014 at 2:26 PM, Warren Gallin <wgallin at ualberta.ca> wrote:
>
>> I have encountered another problem with several GenBank nucleotide
>> sequence records.
>>
>> The exemplar is accession number AF071478.1 which is a sequence of a
>> specific human exon, annotated with a CDS feature that describes the whole
>> protein sequence of which this exon anodes only a part.
>>
>> The CDS feature is described as a join of pieces of sequence from this
>> record and several other records. However, when I try to obtain the full
>> css I am only getting the part of the sequence from the AF071478.1 record,
>> not the others.
>>
>> The following is a minimal script that illustrates this behaviour.
>>
>> My understanding was that if a feature was described as a join of multiple
>> nucleotide sequence records then the spliced sequence() method with a
>> handle to a GenBank created using Bio::DB::GenBank should also collect and
>> splice in the fragments of nucleotide sequence from other GenBank records
>> to create the final spliced sequence. Could someone explain where my
>> understanding is going awry?
>>
>> Once again, thanks for any help.
>>
>> Warren Gallin
>>
>> ______________________________________________________________________
>>
>> #!/usr/bin/perl
>>
>> use strict;
>> use warnings;
>> use DBI;
>> use Bio::Seq;
>> use Bio::DB::EUtilities;
>> use Bio::SeqIO;
>> use Bio::Seq;
>> use Data::Printer;
>> use Bio::DB::GenBank;
>>
>> #Test script for collecting full nucleic acid sequences from a GenBank
>> nucleotide
>> #sequence record that has a feature annotated as joins between parts of a
>> number of
>> #individual nucleotide records.
>>
>> my @coded_by = ();
>> #Single example of a nucleotide sequence entry with a CDS feature that
>> includes joining
>> #with the sequences from several other records.
>>
>> my $na_acc = 'AF071478.1';
>>
>> #The corresponding protein record for the targetted CDS sequence
>>
>> my $hit_acc = 'AAC96101.1';
>>
>> #Create a database handle to GENBANK for retrieving coding sequences,
>> using the gbwithparts
>> #as return type to accomodate large records if they occur.
>>
>> my $gb_db = Bio::DB::GenBank->new( -db => 'genbank', -rettype =>
>> 'gbwithparts' );
>>
>> #Retrieve the genbank record for the test entry
>>
>> my $temp_nuc_record = $gb_db->get_Seq_by_acc($na_acc);
>>
>> foreach my $feat ( $temp_nuc_record->top_SeqFeatures ) {
>> if ( $feat->primary_tag eq 'CDS' ) {
>> eval {
>> @coded_by = $feat->each_tag_value('protein_id');
>> };
>> if ($@) {
>> print
>> "Could not find any protein_id tags for $hit_acc nucleotide $na_acc.\n
>> $@\n";
>> exit;
>> }
>> if ( $coded_by[0] eq $hit_acc ) {
>> my $na_seq = $feat->spliced_seq( -db => $gb_db )->seq;
>> print "$na_seq\n";
>>
>> }
>> }
>> }
>> exit;
>>
>>
>> ______________________________________________________________________________
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list