[Bioperl-l] Using frame info from GFF in getting aSeq->spliced_seq

Tue Dec 12 16:05:43 UTC 2006

(sorry if this thread is boring people)

Chris Fields wrote: 

> > I agree. By the way, I'd love a reference to a simple bio- 
> > explanation of
> > what's happening here. Google searches for "coding sequence 
> phase" are
> > not all that relevant.
> 
> Ah, Brian found some links I see...

Thanks, Brian! Amazing how "coding sequence phase" finds nothing but
"intron phase" finds a ton. This is why you need to actually learn
biology, rather than Googling it.

> Which GFF files did you use?  More specifically, which genes 
> in which  
> GFF file?  I saw a reference to S. bayanus, but it's hard to 
> work out  
> what could be the problem unless we know a bit more.

http://fungal.genome.duke.edu/annotations/sbay/gff/saccharomyces_bayanus
.20031001.AUGUSTUS.gff3.gz (Thanks for a Really Useful site, Jason!)

c127 (for example) has two lines in that file:
sbay_c127       AUGUSTUS        mRNA    263     723     .       +
.       ID=sbay_c127-g1.1
sbay_c127       AUGUSTUS        CDS     263     723     .       +
1       Parent=sbay_c127-g1.1

Now go to gbrowse page:
http://fungal.genome.duke.edu/cgi-bin/gbrowse/sbay/
Type "sbay_c127:250-300" in the search box. 

As you can see from the translation track, if you start at bp 263, you
hit a stop codon after just a few aas. But if you use frame2/phase 1,
you get no stop codons all the way to the end of the contig.

> >> You can already pass the frame or an offset to
> >> PrimarySeqI::translate().
> >>  We could add a '-phase' argument for
> >> convenience which accepts 0,1,2.
> >
> >  What if I
> > want to get the RNA sequence that will become the protein? then  
> > having a
> > phase arg to translate() doesn't help. Should there be a 
> phase arg to
> > spliced_seq?
> 
> You'll also note Jason mentioned there were possible errors in the  
> gene prediction programs which produced the output

That's certainly possible. No gene prediction program will be perfect.
In this case, though, it's clear that it found a large region without
stop codons in it, and correctly identified the place to start
translating. I guess I'm just surprised that, if it found just one exon
in a gene (in the whole contig) why it would say the exon starts at 263
with a phase 1, instead of just saying it starts at 264.

> spliced_seq() is supposed to return the DNA sequence of a split  
> location by splicing together the sublocation sequences in their  
> 'join' order.  So, if the first exon was out of phase, once spliced  
> they should all be out of phase to the same degree, assuming all  
> exons are joined together correctly.   Translating this using the  
> phase should produce the correct amino acid sequence.
> 
> Note that Jason suggested passing the frame/phase of the first exon  
> to translate(), not spliced_seq().  I also suggested translate().

You're right. This brings the number of translated polypeptide sequences
that have lots of *s in them to 9 instead of 90. 

I guess I have two requests here. The first is, if a person wants to see
exactly which bps are translated to aas -- a nucelotide sequece of
exactly 3N bp starting (usually) with ATG -- then they might want an
argument to spliced_seq that skips the first one or two bp when
necessary. After all, they might want to study the DNA, not the
peptides.

The second request is for "intelligent objects". If my SeqFeatures know
that they're in phase 1, then when I call spliced_seq I want the
resulting objects to know that they're phase one, such that when I call
translate, Bioperl automatically skips the first bp or two. Admittedly,
there might be big ramifications to this.

Both requests of course made in the knowledge that Bioperl is open
source & developers have a lot to do with their time.

-Amir Karger

> > Which raises another bio question: at what point are the 
> first 1 or  
> > 2 bp
> > dropped when you have a phase of 1 or 2? Do they appear in the mRNA?
> >
> > -Amir Karger
> 
> Any sequence present in the sublocations (exons) would be in the  
> spliced sequence.  This would have to include those nucleotides in  
> exons skipped b/c of the phase since they are part of the 
> coding region.
> 
> chris
>