[Bioperl-l] Using frame info from GFF in getting aSeq->spliced_seq

Chris Fields cjfields at uiuc.edu
Tue Dec 12 03:20:06 UTC 2006


On Dec 11, 2006, at 10:20 AM, Amir Karger wrote:

>> I think the use of 'frame' here is meant relative to the DNA
>> sequence (i.e.
>> ORF searching, 6 frames) and the 'phase' is relative to the mRNA  
>> (i.e.
>> translation, three frames).  At least I think that's what is meant!
>
> I agree. By the way, I'd love a reference to a simple bio- 
> explanation of
> what's happening here. Google searches for "coding sequence phase" are
> not all that relevant.

Ah, Brian found some links I see...

>> It could be b/c the location coordinates delineate the exon
>> coding boundary.
>> It's conceivable the first exon in a sequence record is not
>> the first exon
>> of the mRNA (i.e. there may be one or more exons prior to or
>> past the exon
>> of interest that are in 'remote' sequence records).
>
> That's certainly not the case here, because the files have the entire
> genomes in them.
>
>> Also, the ends of the lcoation may be uncertain ('fuzzy'):
>>
>> join(complement(1009..>1260),complement(AF081827.1:<1..177))
>
> Also not the case here. These locations aren't listed as fuzzy.
>
> Any other thoughts?

Which GFF files did you use?  More specifically, which genes in which  
GFF file?  I saw a reference to S. bayanus, but it's hard to work out  
what could be the problem unless we know a bit more.

>>> I guess the real question here, which Jason alludes to, is whether
>>> SeqFeature->spliced_seq ought to take into account the phase
>>> information
>>> of the first exon. Right now, it doesn't, so when you call
>>> SeqFeature->spliced_seq->translate, you get gibberish. Are
>> there cases
>>> where you would want spliced_seq to include the first bp or
>>> two? Should there be an option to spliced_seq for whether you
>>> want to take phase information into account?
>>
>> You can already pass the frame or an offset to
>> PrimarySeqI::translate().
>>  We could add a '-phase' argument for
>> convenience which accepts 0,1,2.
>
> But as Jason pointed out, you should find the problem earlier. What  
> if I
> want to get the RNA sequence that will become the protein? then  
> having a
> phase arg to translate() doesn't help. Should there be a phase arg to
> spliced_seq?

You'll also note Jason mentioned there were possible errors in the  
gene prediction programs which produced the output

spliced_seq() is supposed to return the DNA sequence of a split  
location by splicing together the sublocation sequences in their  
'join' order.  So, if the first exon was out of phase, once spliced  
they should all be out of phase to the same degree, assuming all  
exons are joined together correctly.   Translating this using the  
phase should produce the correct amino acid sequence.

Note that Jason suggested passing the frame/phase of the first exon  
to translate(), not spliced_seq().  I also suggested translate().

> Which raises another bio question: at what point are the first 1 or  
> 2 bp
> dropped when you have a phase of 1 or 2? Do they appear in the mRNA?
>
> -Amir Karger

Any sequence present in the sublocations (exons) would be in the  
spliced sequence.  This would have to include those nucleotides in  
exons skipped b/c of the phase since they are part of the coding region.

chris



More information about the Bioperl-l mailing list