[Bioperl-l] Using frame info from GFF in getting a Seq->spliced_seq

Amir Karger akarger at CGR.Harvard.edu
Fri Dec 8 18:25:47 UTC 2006


> This was a problem in the gene prediction output I suspect, more  
> recent versions of the program should have fixed this.  I do not  
> currently have free time to deal with the errors in the small number  
> of ORFs where this has happened.
> 
> I think you just need to do
>   start -= start- (frame*strand)
> for 1st exons.

I used
    if (strand==1) {start += exon->frame}
    else {end -= exon->frame}

This took me from 90 translations that had * within the sequence to just
9, out of 5500 CDS in S bayanus.

> You can also probably provide the 1st exon's frame to the translate  
> function as another possibility but you should try and get the CDS  
> correct first depending on your downstream analyses.

Yes, I think. Scott Cain pointed out that GFF column 8 is the "phase",
which I had never heard of before. My current, very limited,
understanding is that sometimes you'll have an exon with, say, 31 bp,
followed by an exon with 29 bp. When the intron gets spliced out, you
eventually get an mRNA of 60 bp, which translates to a protein of 20 aa.
But the second exon has a phase of 1, not 0, because you can't just
start translating at the first bp of the second exon and expect to get
nice amino acids.

By the way, whether or not phase is the same thing as frame, when I call
the frame() method on the features created by Bio::Tools::GFF, I get the
phase info. I assume that's a feature (no pun intended), not a bug?

I'm still confused as to why you would have a phase in the first exon,
though. Why not just say the CDS starts 1 or 2 bp later? (This is
probably a bio question, not a bioperl question, but a quick Google
didn't get me an answer. "Phase" isn't a very good search term.)

I guess the real question here, which Jason alludes to, is whether
SeqFeature->spliced_seq ought to take into account the phase information
of the first exon. Right now, it doesn't, so when you call
SeqFeature->spliced_seq->translate, you get gibberish. Are there cases
where you would want spliced_seq to include the first bp or two? Should
there be an option to spliced_seq for whether you want to take phase
information into account?

I can't submit a bug report until we confirm it's a bug.

Thanks,
-Amir Karger

> -jason
> On Dec 7, 2006, at 1:32 PM, Amir Karger wrote:
> 
> > I need to know how to get the frame information in exon features
> > (created by Bio::Tools::GFF) into a whole-gene feature that will be
> > translated into a protein.
> >
> > I'm reading in some fungal GFFs generated by Jason Stajich. I
> >
> > - Use Bio::Tools::GFF to create a feature for each exon in a gene
> > - Create a Bio::Location::Split object containing each feature's
> > location
> > - Create a Bio::SeqFeature::Generic object whose location 
> is the above
> > BL::Split
> > - Attach my contig Bio::Seq to the feature
> > - get the protein with feature->spliced_seq->translate->seq
> >
> > (Code below)
> >
> > Unfortunately, I get the wrong result when the GFF features 
> have frame
> > != 0. This happens for only a few percent of the exons, but when it
> > does, I end up translating in the wrong frame.
> >
> > If I read the docs correctly, Location objects don't have a 
> frame. So
> > how do I get the correct spliced_seq, which skips one or 
> two bp at the
> > beginning of certain exons?
> >
> > I suspect the answer to this is that I'm going about this in  
> > completely
> > the wrong way, in which case, please tell me how I ought to be  
> > doing it.
> >
> > Thanks,
> > - Amir Karger
> > Research Computing
> > Life Sciences Division
> > Harvard University
> >
> > P.S. In case you want to see actual code, here it is. After using
> > Bio::Tools::GFF to create a sorted list of features for each exon
> > (basically stolen from the module POD), I:
> >     # Create a new object representing the exons' gene
> >     my $coding_loc_obj = new Bio::Location::Split;
> >     foreach my $exon (@sorted_exons) {
> >         $coding_loc_obj->add_sub_Location($exon->location);
> >     }
> >
> >     # Build a spliced feature representing the whole gene
> >     my $spliced_feat = new Bio::SeqFeature::Generic(
> >         -start  => $coding_loc_obj->start,
> >         -end    => $coding_loc_obj->end,
> >         -strand => $strand_num,
> >         -primary=> "splicedGene",
> >     );
> >     $spliced_feat->location($coding_loc_obj);
> >
> >     # Attach a contig object containing the sequence
> >     $spliced_feat->attach_seq($contig_obj->bioperl_object);
> >
> >     # Get the spliced seq and translate to protein:
> >     my $coding_seq = $spliced_feat->spliced_seq->seq;
> >     my $protein = $spliced_feat->spliced_seq->translate->seq;
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> 




More information about the Bioperl-l mailing list