[Bioperl-l] Enhance Bio::PrimarySeqI::trunc() for Bio::Location::Split ?

Thu Mar 1 15:51:38 UTC 2007

In my GenBank files when I'm sitting on a CDS usually I can just call

    $feature->seq->seq;

and out pops the exact nucleotide sequence which codes my protein. Very 
cool.

Unfortunately, I have a crazy GenBank file which contains a CDS with a 
split range like this:  CDS join(1959..2355,1..92)

When I try to use $feature->seq->seq I don't end up with just the properly 
pieced together coding region, I end up with the *entire* nucleotide 
sequence.

This seems to be happening because

   Bio::SeqFeature::Generic::seq
   506:  my $seq = $self->{'_gsf_seq'}->trunc($self->start(), $self->end());
   (which is calling Bio::PrimarySeqI::trunc)

works fine when Bio::SeqFeature::Generic is using

    '_location' => Bio::Location::Simple=HASH(0x1804344)
       '_end' => 2842
       '_location_type' => 'EXACT'
       '_root_verbose' => 0
       '_seqid' => 'AE015930'
       '_start' => 1601
       '_strand' => 1

but when things get complicated and Bio::SeqFeature::Generic is using

    '_location' => Bio::Location::Split=HASH(0x1d1f130)
       '_seqid' => 'PNECG'
       '_splittype' => 'JOIN'
       '_sublocations' => ARRAY(0x1d1e654)
          0  Bio::Location::Simple=HASH(0x1d1f290)
             '_end' => 2355
             '_location_type' => 'EXACT'
             '_root_verbose' => 0
             '_seqid' => 'PNECG'
             '_start' => 1959
             '_strand' => 1
          1  Bio::Location::Simple=HASH(0x1d1f338)
             '_end' => 92
             '_location_type' => 'EXACT'
             '_root_verbose' => 0
             '_seqid' => 'PNECG'
             '_start' => 1
             '_strand' => 1

Simply passing $self->start and $self->end into trunc() will not pull 
off the appropriate magic.

Question 1: Perhaps my data was bad and I should refuse to process 
join(1959..2355,1..92)? My accession is M12730, and if I download that 
from NCBI now it looks like they've changed it so my problem no longer 
exists in that sequence anyway. There are already 71 examples of CDS join 
in various files in t/data, and *none* of those examples jump backwards. 
Should I write this off as bad data or try to enhance BioPerl? I'm happy 
to throw my painful M12730 on the end of t/data/test.genbank and write 
tests for it if anyone thinks it is important.

Question 2: Even if we can just ignore my M12730, though, I think there's 
still a problem afoot. Below I demo L26462 (already siting in 
t/data/test.genbank) which has a

    CDS join(866..957,1088..1310,2161..2289)

In this case (as my tests below demonstrate), $feature->seq->seq is 
pulling the right range of nucleotide, but it's also pulling the gaps 
(introns). Isn't that wrong? Shouldn't it skip the introns?

So... is the appropriate approach to try to enhance 
Bio::PrimarySeqI::trunc() for Bio::Location::Split? Or should trunc() be 
left alone and Bio::SeqFeature::Generic::seq() needs to get smarter?

Or...?

Thanks, oh mighty BioWizards!  :)

j
seqlab.net
http://www.bioperl.org/wiki/User:Jhannah

-----------------
Tack this on the end of t/genbank.t and the length test at the end fails:
-----------------
# Enhance Bio::PrimarySeqI::trunc() for Bio::Location::Split ?
my $stream = Bio::SeqIO->new(-file => Bio::Root::IO->catfile
                                       ("t","data","test.genbank"),
                                       -verbose => $verbose,
                              -format => 'genbank');
my $seq = $stream->next_seq;
while ($seq->accession ne "M37762") {
    $seq = $stream->next_seq;
}
# M37762 has a CDS 76..819, which should work fine.
ok(my @features = $seq->get_SeqFeatures(), "get_SeqFeatures()");
my $feat;
foreach my $feat2 ( @features ) {
    next unless ($feat2->primary_tag eq "CDS");
    my @db_xrefs = $feat2->annotation->get_Annotations("db_xref");
    if (grep { $_ eq "GI:179403" } @db_xrefs) {
       $feat = $feat2;
       last;
    }
}
my ($protein_seq) = $feat->annotation->get_Annotations("translation");
ok($protein_seq    =~ /^MTILFLTMVISYFGCMKA.*GWRFIRIDTSCVCTLTIKRGR$/,   "protein sequence");
my ($nucleotide_seq) = $feat->seq->seq;
ok($nucleotide_seq =~ /^ATGACCATCCTTTTCCTT.*ACCATTAAAAGGGGAAGATAG$/,   "nucleotide sequence");
is(length($nucleotide_seq), 744,                                       "nucleotide length");

# Jump down to L26462 which has a CDS join(866..957,1088..1310,2161..2289), which is broken?
while ($seq->accession ne "L26462") {
    $seq = $stream->next_seq;
}
ok(my @features = $seq->get_SeqFeatures(), "get_SeqFeatures()");
my $feat;
foreach my $feat2 ( @features ) {
    next unless ($feat2->primary_tag eq "CDS");
    my @db_xrefs = $feat2->annotation->get_Annotations("db_xref");
    if (grep { $_ eq "GI:532506" } @db_xrefs) {
       $feat = $feat2;
       last;
    }
}
my ($protein_seq) = $feat->annotation->get_Annotations("translation");
ok($protein_seq    =~ /^MVHLTPEEKSAVTALWGK.*VQAAYQKVVAGVANALAHKYH$/,   "protein sequence");
my ($nucleotide_seq) = $feat->seq->seq;
ok($nucleotide_seq =~ /^ATGGTGCATCTGACTCCT.*CTGGCCCACAAGTATCACTAA$/,   "nucleotide sequence - correct CDS range");
#print "[$nucleotide_seq]\n";
ok($nucleotide_seq !~ /^ACCTCCTATTTGACACCA.*TGCTAGTCTCCCGGAACTATC$/,   "nucleotide sequence - full nucleotide should not match");
is(length($nucleotide_seq), 444,                                       "nucleotide length");

# I have an old(?) version of M12730 which lists
#    CDS join(1959..2355,1..92)
#    /db_xref="GI:150830"
# Crazy ranges like that don't work at all, you end up with the full nucleotide sequence...
# But NCBI doesn't list M12730 that way any more, so now I would be OK?

# ------------------