[Bioperl-l] Enhance Bio::PrimarySeqI::trunc() for Bio::Location::Split ?

Thu Mar 1 15:57:02 UTC 2007

Jay,

Have you tried using $feature->spliced_seq() instead of seq()?  Using  
seq() retrieves the full sequence for the split location (from start  
of first sublocation to end of last), while spliced_seq() splices the  
sublocation sequences together, which is what I think you want.

chris

On Mar 1, 2007, at 9:51 AM, Jay Hannah wrote:

> In my GenBank files when I'm sitting on a CDS usually I can just call
>
>     $feature->seq->seq;
>
> and out pops the exact nucleotide sequence which codes my protein.  
> Very
> cool.
>
> Unfortunately, I have a crazy GenBank file which contains a CDS with a
> split range like this:  CDS join(1959..2355,1..92)
>
> When I try to use $feature->seq->seq I don't end up with just the  
> properly
> pieced together coding region, I end up with the *entire* nucleotide
> sequence.
>
> This seems to be happening because
>
>    Bio::SeqFeature::Generic::seq
>    506:  my $seq = $self->{'_gsf_seq'}->trunc($self->start(), $self- 
> >end());
>    (which is calling Bio::PrimarySeqI::trunc)
>
> works fine when Bio::SeqFeature::Generic is using
>
>     '_location' => Bio::Location::Simple=HASH(0x1804344)
>        '_end' => 2842
>        '_location_type' => 'EXACT'
>        '_root_verbose' => 0
>        '_seqid' => 'AE015930'
>        '_start' => 1601
>        '_strand' => 1
>
> but when things get complicated and Bio::SeqFeature::Generic is using
>
>     '_location' => Bio::Location::Split=HASH(0x1d1f130)
>        '_seqid' => 'PNECG'
>        '_splittype' => 'JOIN'
>        '_sublocations' => ARRAY(0x1d1e654)
>           0  Bio::Location::Simple=HASH(0x1d1f290)
>              '_end' => 2355
>              '_location_type' => 'EXACT'
>              '_root_verbose' => 0
>              '_seqid' => 'PNECG'
>              '_start' => 1959
>              '_strand' => 1
>           1  Bio::Location::Simple=HASH(0x1d1f338)
>              '_end' => 92
>              '_location_type' => 'EXACT'
>              '_root_verbose' => 0
>              '_seqid' => 'PNECG'
>              '_start' => 1
>              '_strand' => 1
>
> Simply passing $self->start and $self->end into trunc() will not pull
> off the appropriate magic.
>
> Question 1: Perhaps my data was bad and I should refuse to process
> join(1959..2355,1..92)? My accession is M12730, and if I download that
> from NCBI now it looks like they've changed it so my problem no longer
> exists in that sequence anyway. There are already 71 examples of  
> CDS join
> in various files in t/data, and *none* of those examples jump  
> backwards.
> Should I write this off as bad data or try to enhance BioPerl? I'm  
> happy
> to throw my painful M12730 on the end of t/data/test.genbank and write
> tests for it if anyone thinks it is important.
>
> Question 2: Even if we can just ignore my M12730, though, I think  
> there's
> still a problem afoot. Below I demo L26462 (already siting in
> t/data/test.genbank) which has a
>
>     CDS join(866..957,1088..1310,2161..2289)
>
> In this case (as my tests below demonstrate), $feature->seq->seq is
> pulling the right range of nucleotide, but it's also pulling the gaps
> (introns). Isn't that wrong? Shouldn't it skip the introns?
>
> So... is the appropriate approach to try to enhance
> Bio::PrimarySeqI::trunc() for Bio::Location::Split? Or should trunc 
> () be
> left alone and Bio::SeqFeature::Generic::seq() needs to get smarter?
>
> Or...?
>
> Thanks, oh mighty BioWizards!  :)
>
> j
> seqlab.net
> http://www.bioperl.org/wiki/User:Jhannah
>
>
>
> -----------------
> Tack this on the end of t/genbank.t and the length test at the end  
> fails:
> -----------------
> # Enhance Bio::PrimarySeqI::trunc() for Bio::Location::Split ?
> my $stream = Bio::SeqIO->new(-file => Bio::Root::IO->catfile
>                                        ("t","data","test.genbank"),
>                                        -verbose => $verbose,
>                               -format => 'genbank');
> my $seq = $stream->next_seq;
> while ($seq->accession ne "M37762") {
>     $seq = $stream->next_seq;
> }
> # M37762 has a CDS 76..819, which should work fine.
> ok(my @features = $seq->get_SeqFeatures(), "get_SeqFeatures()");
> my $feat;
> foreach my $feat2 ( @features ) {
>     next unless ($feat2->primary_tag eq "CDS");
>     my @db_xrefs = $feat2->annotation->get_Annotations("db_xref");
>     if (grep { $_ eq "GI:179403" } @db_xrefs) {
>        $feat = $feat2;
>        last;
>     }
> }
> my ($protein_seq) = $feat->annotation->get_Annotations("translation");
> ok($protein_seq    =~ /^MTILFLTMVISYFGCMKA.*GWRFIRIDTSCVCTLTIKRGR 
> $/,   "protein sequence");
> my ($nucleotide_seq) = $feat->seq->seq;
> ok($nucleotide_seq =~ /^ATGACCATCCTTTTCCTT.*ACCATTAAAAGGGGAAGATAG 
> $/,   "nucleotide sequence");
> is(length($nucleotide_seq),  
> 744,                                       "nucleotide length");
>
> # Jump down to L26462 which has a CDS join 
> (866..957,1088..1310,2161..2289), which is broken?
> while ($seq->accession ne "L26462") {
>     $seq = $stream->next_seq;
> }
> ok(my @features = $seq->get_SeqFeatures(), "get_SeqFeatures()");
> my $feat;
> foreach my $feat2 ( @features ) {
>     next unless ($feat2->primary_tag eq "CDS");
>     my @db_xrefs = $feat2->annotation->get_Annotations("db_xref");
>     if (grep { $_ eq "GI:532506" } @db_xrefs) {
>        $feat = $feat2;
>        last;
>     }
> }
> my ($protein_seq) = $feat->annotation->get_Annotations("translation");
> ok($protein_seq    =~ /^MVHLTPEEKSAVTALWGK.*VQAAYQKVVAGVANALAHKYH 
> $/,   "protein sequence");
> my ($nucleotide_seq) = $feat->seq->seq;
> ok($nucleotide_seq =~ /^ATGGTGCATCTGACTCCT.*CTGGCCCACAAGTATCACTAA 
> $/,   "nucleotide sequence - correct CDS range");
> #print "[$nucleotide_seq]\n";
> ok($nucleotide_seq !~ /^ACCTCCTATTTGACACCA.*TGCTAGTCTCCCGGAACTATC 
> $/,   "nucleotide sequence - full nucleotide should not match");
> is(length($nucleotide_seq),  
> 444,                                       "nucleotide length");
>
> # I have an old(?) version of M12730 which lists
> #    CDS join(1959..2355,1..92)
> #    /db_xref="GI:150830"
> # Crazy ranges like that don't work at all, you end up with the  
> full nucleotide sequence...
> # But NCBI doesn't list M12730 that way any more, so now I would be  
> OK?
>
> # ------------------