[Bioperl-l] Enhance Bio::PrimarySeqI::trunc() for Bio::Location::Split ?
Chris Fields
cjfields at uiuc.edu
Thu Mar 1 15:57:02 UTC 2007
Jay,
Have you tried using $feature->spliced_seq() instead of seq()? Using
seq() retrieves the full sequence for the split location (from start
of first sublocation to end of last), while spliced_seq() splices the
sublocation sequences together, which is what I think you want.
chris
On Mar 1, 2007, at 9:51 AM, Jay Hannah wrote:
> In my GenBank files when I'm sitting on a CDS usually I can just call
>
> $feature->seq->seq;
>
> and out pops the exact nucleotide sequence which codes my protein.
> Very
> cool.
>
> Unfortunately, I have a crazy GenBank file which contains a CDS with a
> split range like this: CDS join(1959..2355,1..92)
>
> When I try to use $feature->seq->seq I don't end up with just the
> properly
> pieced together coding region, I end up with the *entire* nucleotide
> sequence.
>
> This seems to be happening because
>
> Bio::SeqFeature::Generic::seq
> 506: my $seq = $self->{'_gsf_seq'}->trunc($self->start(), $self-
> >end());
> (which is calling Bio::PrimarySeqI::trunc)
>
> works fine when Bio::SeqFeature::Generic is using
>
> '_location' => Bio::Location::Simple=HASH(0x1804344)
> '_end' => 2842
> '_location_type' => 'EXACT'
> '_root_verbose' => 0
> '_seqid' => 'AE015930'
> '_start' => 1601
> '_strand' => 1
>
> but when things get complicated and Bio::SeqFeature::Generic is using
>
> '_location' => Bio::Location::Split=HASH(0x1d1f130)
> '_seqid' => 'PNECG'
> '_splittype' => 'JOIN'
> '_sublocations' => ARRAY(0x1d1e654)
> 0 Bio::Location::Simple=HASH(0x1d1f290)
> '_end' => 2355
> '_location_type' => 'EXACT'
> '_root_verbose' => 0
> '_seqid' => 'PNECG'
> '_start' => 1959
> '_strand' => 1
> 1 Bio::Location::Simple=HASH(0x1d1f338)
> '_end' => 92
> '_location_type' => 'EXACT'
> '_root_verbose' => 0
> '_seqid' => 'PNECG'
> '_start' => 1
> '_strand' => 1
>
> Simply passing $self->start and $self->end into trunc() will not pull
> off the appropriate magic.
>
> Question 1: Perhaps my data was bad and I should refuse to process
> join(1959..2355,1..92)? My accession is M12730, and if I download that
> from NCBI now it looks like they've changed it so my problem no longer
> exists in that sequence anyway. There are already 71 examples of
> CDS join
> in various files in t/data, and *none* of those examples jump
> backwards.
> Should I write this off as bad data or try to enhance BioPerl? I'm
> happy
> to throw my painful M12730 on the end of t/data/test.genbank and write
> tests for it if anyone thinks it is important.
>
> Question 2: Even if we can just ignore my M12730, though, I think
> there's
> still a problem afoot. Below I demo L26462 (already siting in
> t/data/test.genbank) which has a
>
> CDS join(866..957,1088..1310,2161..2289)
>
> In this case (as my tests below demonstrate), $feature->seq->seq is
> pulling the right range of nucleotide, but it's also pulling the gaps
> (introns). Isn't that wrong? Shouldn't it skip the introns?
>
> So... is the appropriate approach to try to enhance
> Bio::PrimarySeqI::trunc() for Bio::Location::Split? Or should trunc
> () be
> left alone and Bio::SeqFeature::Generic::seq() needs to get smarter?
>
> Or...?
>
> Thanks, oh mighty BioWizards! :)
>
> j
> seqlab.net
> http://www.bioperl.org/wiki/User:Jhannah
>
>
>
> -----------------
> Tack this on the end of t/genbank.t and the length test at the end
> fails:
> -----------------
> # Enhance Bio::PrimarySeqI::trunc() for Bio::Location::Split ?
> my $stream = Bio::SeqIO->new(-file => Bio::Root::IO->catfile
> ("t","data","test.genbank"),
> -verbose => $verbose,
> -format => 'genbank');
> my $seq = $stream->next_seq;
> while ($seq->accession ne "M37762") {
> $seq = $stream->next_seq;
> }
> # M37762 has a CDS 76..819, which should work fine.
> ok(my @features = $seq->get_SeqFeatures(), "get_SeqFeatures()");
> my $feat;
> foreach my $feat2 ( @features ) {
> next unless ($feat2->primary_tag eq "CDS");
> my @db_xrefs = $feat2->annotation->get_Annotations("db_xref");
> if (grep { $_ eq "GI:179403" } @db_xrefs) {
> $feat = $feat2;
> last;
> }
> }
> my ($protein_seq) = $feat->annotation->get_Annotations("translation");
> ok($protein_seq =~ /^MTILFLTMVISYFGCMKA.*GWRFIRIDTSCVCTLTIKRGR
> $/, "protein sequence");
> my ($nucleotide_seq) = $feat->seq->seq;
> ok($nucleotide_seq =~ /^ATGACCATCCTTTTCCTT.*ACCATTAAAAGGGGAAGATAG
> $/, "nucleotide sequence");
> is(length($nucleotide_seq),
> 744, "nucleotide length");
>
> # Jump down to L26462 which has a CDS join
> (866..957,1088..1310,2161..2289), which is broken?
> while ($seq->accession ne "L26462") {
> $seq = $stream->next_seq;
> }
> ok(my @features = $seq->get_SeqFeatures(), "get_SeqFeatures()");
> my $feat;
> foreach my $feat2 ( @features ) {
> next unless ($feat2->primary_tag eq "CDS");
> my @db_xrefs = $feat2->annotation->get_Annotations("db_xref");
> if (grep { $_ eq "GI:532506" } @db_xrefs) {
> $feat = $feat2;
> last;
> }
> }
> my ($protein_seq) = $feat->annotation->get_Annotations("translation");
> ok($protein_seq =~ /^MVHLTPEEKSAVTALWGK.*VQAAYQKVVAGVANALAHKYH
> $/, "protein sequence");
> my ($nucleotide_seq) = $feat->seq->seq;
> ok($nucleotide_seq =~ /^ATGGTGCATCTGACTCCT.*CTGGCCCACAAGTATCACTAA
> $/, "nucleotide sequence - correct CDS range");
> #print "[$nucleotide_seq]\n";
> ok($nucleotide_seq !~ /^ACCTCCTATTTGACACCA.*TGCTAGTCTCCCGGAACTATC
> $/, "nucleotide sequence - full nucleotide should not match");
> is(length($nucleotide_seq),
> 444, "nucleotide length");
>
> # I have an old(?) version of M12730 which lists
> # CDS join(1959..2355,1..92)
> # /db_xref="GI:150830"
> # Crazy ranges like that don't work at all, you end up with the
> full nucleotide sequence...
> # But NCBI doesn't list M12730 that way any more, so now I would be
> OK?
>
> # ------------------
More information about the Bioperl-l
mailing list