[Bioperl-l] Using frame info from GFF in getting aSeq->spliced_seq

Tue Dec 12 18:45:05 UTC 2006

On Dec 12, 2006, at 10:05 AM, Amir Karger wrote:
...

> http://fungal.genome.duke.edu/annotations/sbay/gff/ 
> saccharomyces_bayanus
> .20031001.AUGUSTUS.gff3.gz (Thanks for a Really Useful site, Jason!)
>
> c127 (for example) has two lines in that file:
> sbay_c127       AUGUSTUS        mRNA    263     723     .       +
> .       ID=sbay_c127-g1.1
> sbay_c127       AUGUSTUS        CDS     263     723     .       +
> 1       Parent=sbay_c127-g1.1
>
> Now go to gbrowse page:
> http://fungal.genome.duke.edu/cgi-bin/gbrowse/sbay/
> Type "sbay_c127:250-300" in the search box.
>
> As you can see from the translation track, if you start at bp 263, you
> hit a stop codon after just a few aas. But if you use frame2/phase 1,
> you get no stop codons all the way to the end of the contig.

Yes, but there are two things.  First, there is no distinct start  
codon.  Second, this is what the top NCBI BLASTX hit for that  
particular exon is:

 >gi|6323195|ref|NP_013267.1| Gene info Essential 100kDa subunit of  
the exocyst complex (Sec3p, Sec5p,
Sec6p, Sec8p, Sec10p, Sec15p, Exo70p, and Exo84p), which has
the essential function of mediating polarized targeting of
secretory vesicles to active sites of exocytosis; Sec10p [Saccharomyces
cerevisiae]
  gi|2498891|sp|Q06245|SEC10_YEAST Gene info Exocyst complex  
component SEC10
  gi|1234854|gb|AAB67490.1| Gene info L9362.12 gene product
  gi|1781307|emb|CAA70041.1| Gene info 100 kD exocyst complex  
component [Saccharomyces cerevisiae]
Length=871

  Score =  285 bits (728),  Expect = 7e-77
  Identities = 141/152 (92%), Positives = 149/152 (98%), Gaps = 0/152  
(0%)
  Frame = +2

Query  2     
FNDFYSMGKSDIVEQLRLSKNWKFNLKSVILMKNLLILSSKLETNSIPKTINTKLIIEKY  181
             +NDFYSMGKSDIVEQLRLSKNWK NLKSV LMKNLLILSSKLET+SIPKTINTKL 
+IEKY
Sbjct  168   
YNDFYSMGKSDIVEQLRLSKNWKLNLKSVKLMKNLLILSSKLETSSIPKTINTKLVIEKY  227

Query  182   
SEMMENKLLENFNSAYRENNFTKLNEIAIILNNFNGGVNVIQSFINQHDYFIDTKQIDLE  361
             SEMMEN 
+LLENFNSAYRENNFTKLNEIAIILNNFNGGVNVIQSFINQHDYFIDTKQIDLE
Sbjct  228   
SEMMENELLENFNSAYRENNFTKLNEIAIILNNFNGGVNVIQSFINQHDYFIDTKQIDLE  287

Query  362  NEFENVFIKNVKFKERLVDFESHSVIVEASMQ  457
             NEFENVFIKNVKFKE+L+DFE+HSVI+E SMQ
Sbjct  288  NEFENVFIKNVKFKEQLIDFENHSVIIETSMQ  319

Note the query start is well into the predicted coding sequence.   
Both the lack of a start codon and the above BLASTX hit suggest this  
is not actually the first exon in the coding region.  Therefore the  
sequence retrieved from spliced_seq() is only part of the full coding  
region (it seems to lack at least one 3' exon as well).

>>>> You can already pass the frame or an offset to
>>>> PrimarySeqI::translate().
>>>>  We could add a '-phase' argument for
>>>> convenience which accepts 0,1,2.
>>>
>>>  What if I
>>> want to get the RNA sequence that will become the protein? then
>>> having a
>>> phase arg to translate() doesn't help. Should there be a
>> phase arg to
>>> spliced_seq?
>>
>> You'll also note Jason mentioned there were possible errors in the
>> gene prediction programs which produced the output
>
> That's certainly possible. No gene prediction program will be perfect.
> In this case, though, it's clear that it found a large region without
> stop codons in it, and correctly identified the place to start
> translating. I guess I'm just surprised that, if it found just one  
> exon
> in a gene (in the whole contig) why it would say the exon starts at  
> 263
> with a phase 1, instead of just saying it starts at 264.

Maybe the gene prediction didn't find the first exon, or didn't tie  
the predicted exons together.  Not unusual considering the number of  
predictions made.

>> spliced_seq() is supposed to return the DNA sequence of a split
>> location by splicing together the sublocation sequences in their
>> 'join' order.  So, if the first exon was out of phase, once spliced
>> they should all be out of phase to the same degree, assuming all
>> exons are joined together correctly.   Translating this using the
>> phase should produce the correct amino acid sequence.
>>
>> Note that Jason suggested passing the frame/phase of the first exon
>> to translate(), not spliced_seq().  I also suggested translate().
>
> You're right. This brings the number of translated polypeptide  
> sequences
> that have lots of *s in them to 9 instead of 90.
>
> I guess I have two requests here. The first is, if a person wants  
> to see
> exactly which bps are translated to aas -- a nucelotide sequece of
> exactly 3N bp starting (usually) with ATG -- then they might want an
> argument to spliced_seq that skips the first one or two bp when
> necessary. After all, they might want to study the DNA, not the
> peptides.
>
> The second request is for "intelligent objects". If my SeqFeatures  
> know
> that they're in phase 1, then when I call spliced_seq I want the
> resulting objects to know that they're phase one, such that when I  
> call
> translate, Bioperl automatically skips the first bp or two.  
> Admittedly,
> there might be big ramifications to this.
>
> Both requests of course made in the knowledge that Bioperl is open
> source & developers have a lot to do with their time.
>
> -Amir Karger

You may want to post these as enhancement requests to Bugzilla just  
so we can keep track.  I think passing a phase parameter to  
spliced_seq() can be easily accomplished; it's just a matter of  
returning a subseq of the spliced sequence based on the phase if  
set.  In fact, I am testing it out now.

The second may be more problematic, since there may be a time when  
one would want those extra nucleotides, so I don't think we would  
want removal of said nucleotides to be the default behavior.

Chris