[Bioperl-l] Using frame info from GFF in getting aSeq->spliced_seq
Chris Fields
cjfields at uiuc.edu
Tue Dec 12 18:45:05 UTC 2006
On Dec 12, 2006, at 10:05 AM, Amir Karger wrote:
...
> http://fungal.genome.duke.edu/annotations/sbay/gff/
> saccharomyces_bayanus
> .20031001.AUGUSTUS.gff3.gz (Thanks for a Really Useful site, Jason!)
>
> c127 (for example) has two lines in that file:
> sbay_c127 AUGUSTUS mRNA 263 723 . +
> . ID=sbay_c127-g1.1
> sbay_c127 AUGUSTUS CDS 263 723 . +
> 1 Parent=sbay_c127-g1.1
>
> Now go to gbrowse page:
> http://fungal.genome.duke.edu/cgi-bin/gbrowse/sbay/
> Type "sbay_c127:250-300" in the search box.
>
> As you can see from the translation track, if you start at bp 263, you
> hit a stop codon after just a few aas. But if you use frame2/phase 1,
> you get no stop codons all the way to the end of the contig.
Yes, but there are two things. First, there is no distinct start
codon. Second, this is what the top NCBI BLASTX hit for that
particular exon is:
>gi|6323195|ref|NP_013267.1| Gene info Essential 100kDa subunit of
the exocyst complex (Sec3p, Sec5p,
Sec6p, Sec8p, Sec10p, Sec15p, Exo70p, and Exo84p), which has
the essential function of mediating polarized targeting of
secretory vesicles to active sites of exocytosis; Sec10p [Saccharomyces
cerevisiae]
gi|2498891|sp|Q06245|SEC10_YEAST Gene info Exocyst complex
component SEC10
gi|1234854|gb|AAB67490.1| Gene info L9362.12 gene product
gi|1781307|emb|CAA70041.1| Gene info 100 kD exocyst complex
component [Saccharomyces cerevisiae]
Length=871
Score = 285 bits (728), Expect = 7e-77
Identities = 141/152 (92%), Positives = 149/152 (98%), Gaps = 0/152
(0%)
Frame = +2
Query 2
FNDFYSMGKSDIVEQLRLSKNWKFNLKSVILMKNLLILSSKLETNSIPKTINTKLIIEKY 181
+NDFYSMGKSDIVEQLRLSKNWK NLKSV LMKNLLILSSKLET+SIPKTINTKL
+IEKY
Sbjct 168
YNDFYSMGKSDIVEQLRLSKNWKLNLKSVKLMKNLLILSSKLETSSIPKTINTKLVIEKY 227
Query 182
SEMMENKLLENFNSAYRENNFTKLNEIAIILNNFNGGVNVIQSFINQHDYFIDTKQIDLE 361
SEMMEN
+LLENFNSAYRENNFTKLNEIAIILNNFNGGVNVIQSFINQHDYFIDTKQIDLE
Sbjct 228
SEMMENELLENFNSAYRENNFTKLNEIAIILNNFNGGVNVIQSFINQHDYFIDTKQIDLE 287
Query 362 NEFENVFIKNVKFKERLVDFESHSVIVEASMQ 457
NEFENVFIKNVKFKE+L+DFE+HSVI+E SMQ
Sbjct 288 NEFENVFIKNVKFKEQLIDFENHSVIIETSMQ 319
Note the query start is well into the predicted coding sequence.
Both the lack of a start codon and the above BLASTX hit suggest this
is not actually the first exon in the coding region. Therefore the
sequence retrieved from spliced_seq() is only part of the full coding
region (it seems to lack at least one 3' exon as well).
>>>> You can already pass the frame or an offset to
>>>> PrimarySeqI::translate().
>>>> We could add a '-phase' argument for
>>>> convenience which accepts 0,1,2.
>>>
>>> What if I
>>> want to get the RNA sequence that will become the protein? then
>>> having a
>>> phase arg to translate() doesn't help. Should there be a
>> phase arg to
>>> spliced_seq?
>>
>> You'll also note Jason mentioned there were possible errors in the
>> gene prediction programs which produced the output
>
> That's certainly possible. No gene prediction program will be perfect.
> In this case, though, it's clear that it found a large region without
> stop codons in it, and correctly identified the place to start
> translating. I guess I'm just surprised that, if it found just one
> exon
> in a gene (in the whole contig) why it would say the exon starts at
> 263
> with a phase 1, instead of just saying it starts at 264.
Maybe the gene prediction didn't find the first exon, or didn't tie
the predicted exons together. Not unusual considering the number of
predictions made.
>> spliced_seq() is supposed to return the DNA sequence of a split
>> location by splicing together the sublocation sequences in their
>> 'join' order. So, if the first exon was out of phase, once spliced
>> they should all be out of phase to the same degree, assuming all
>> exons are joined together correctly. Translating this using the
>> phase should produce the correct amino acid sequence.
>>
>> Note that Jason suggested passing the frame/phase of the first exon
>> to translate(), not spliced_seq(). I also suggested translate().
>
> You're right. This brings the number of translated polypeptide
> sequences
> that have lots of *s in them to 9 instead of 90.
>
> I guess I have two requests here. The first is, if a person wants
> to see
> exactly which bps are translated to aas -- a nucelotide sequece of
> exactly 3N bp starting (usually) with ATG -- then they might want an
> argument to spliced_seq that skips the first one or two bp when
> necessary. After all, they might want to study the DNA, not the
> peptides.
>
> The second request is for "intelligent objects". If my SeqFeatures
> know
> that they're in phase 1, then when I call spliced_seq I want the
> resulting objects to know that they're phase one, such that when I
> call
> translate, Bioperl automatically skips the first bp or two.
> Admittedly,
> there might be big ramifications to this.
>
> Both requests of course made in the knowledge that Bioperl is open
> source & developers have a lot to do with their time.
>
> -Amir Karger
You may want to post these as enhancement requests to Bugzilla just
so we can keep track. I think passing a phase parameter to
spliced_seq() can be easily accomplished; it's just a matter of
returning a subseq of the spliced sequence based on the phase if
set. In fact, I am testing it out now.
The second may be more problematic, since there may be a time when
one would want those extra nucleotides, so I don't think we would
want removal of said nucleotides to be the default behavior.
Chris
More information about the Bioperl-l
mailing list