[Bioperl-l] Another problem with a joined feature

Thu Apr 17 21:26:42 UTC 2014

I have encountered another problem with several GenBank nucleotide sequence records.

The exemplar is accession number AF071478.1 which is a sequence of a specific human exon, annotated with a CDS feature that describes the whole protein sequence  of which this exon anodes only a part.

The CDS feature is described as a join of pieces of sequence from this record and several other records.  However, when I try to obtain the full css I am only getting the part of the sequence from the AF071478.1 record, not the others. 

The following is a minimal script that illustrates this behaviour.

My understanding was that if a feature was described as a join of multiple nucleotide sequence records then the spliced sequence() method with a handle to a GenBank created using Bio::DB::GenBank should also collect and splice in the fragments of nucleotide sequence from other GenBank records to create the final spliced sequence.  Could someone explain where my understanding is going awry?

Once again, thanks for any help.

Warren Gallin

______________________________________________________________________

#!/usr/bin/perl 

use strict;
use warnings;
use DBI;
use Bio::Seq;
use Bio::DB::EUtilities;
use Bio::SeqIO;
use Bio::Seq;
use Data::Printer;
use Bio::DB::GenBank;

#Test script for collecting full nucleic acid sequences from a GenBank nucleotide
#sequence record that has a feature annotated as joins between parts of a number of
#individual nucleotide records.

my @coded_by = ();
#Single example of a nucleotide sequence entry with a CDS feature that includes joining
#with the sequences from several other records.

my $na_acc = 'AF071478.1';

#The corresponding protein record for the targetted CDS sequence

my $hit_acc = 'AAC96101.1';

#Create a database handle to GENBANK for retrieving coding sequences, using the gbwithparts
#as return type to accomodate large records if they occur.

my $gb_db = Bio::DB::GenBank->new( -db => 'genbank', -rettype => 'gbwithparts' );

#Retrieve the genbank record for the test entry

my $temp_nuc_record = $gb_db->get_Seq_by_acc($na_acc);

foreach my $feat ( $temp_nuc_record->top_SeqFeatures ) {
    if ( $feat->primary_tag eq 'CDS' ) {
        eval {
        	@coded_by = $feat->each_tag_value('protein_id');
        };
        if ($@) {
            print
"Could not find any protein_id tags for $hit_acc nucleotide $na_acc.\n $@\n";
            exit;
        }
        if ( $coded_by[0] eq $hit_acc ) {
            my $na_seq = $feat->spliced_seq( -db => $gb_db )->seq;
            print "$na_seq\n";

        }
    }
}
exit;

______________________________________________________________________________