[Bioperl-l] Polyproteins, ribo slippage, and mat_peptide in viruses?

Tue Oct 27 20:46:05 UTC 2009

On Oct 27, 2009, at 3:17 PM, Peter wrote:

> On Tue, Oct 27, 2009 at 8:07 PM, Chris Larsen <clarsen at vecna.com>  
> wrote:
>>
>> Peter,
>>
>> This is a good strategy when the gi is given. However I failed to  
>> mention
>> that we are finding the example I gave is unusual (15%?)---most virus
>> 'mature peptides' we will apply this analysis to do not in fact  
>> have a gi
>> number or unique identifier associated with them. There are  
>> thousands of
>> dengue virus files to be processed to give mature proteins.
>>
>> Should have mentioned this...Hence the problem--we cant look it up  
>> because
>> only the parent polyprotein has a gi. Theres nothing to look up / 
>> by/ in most
>> cases. So we still have to build a set of proteins that are cleaved  
>> out of
>> every polyprotein, by local and high throughput methods, by  
>> building it out
>> of the available information (sadly, kind of a run around-- it  
>> should be in
>> the genbank entry).
>>
>> Chris
>
> Ah. That's a shame. I did just take a few minutes to try out the
> EFetch idea (using Biopython) and it does work beautifully for
> this "nice" example virus which the NCBI have annotated.

Interesting thing about that example: if you follow the hyperlinks for  
the mat_peptide feature key, they relate back to the full protein  
sequence with from/to, not to the protein_id for the feature.  Example:

# link from the first mat_peptide
http://www.ncbi.nlm.nih.gov/protein/9630804?from=1&to=398&report=gpwithparts

# protein_id
http://www.ncbi.nlm.nih.gov/protein/28416959

This record doesn't appear to contain any mapping information along  
those lines, which makes me think this is an autogenerated record  
using the Gene record, which does have those mappings:

http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Retrieve&dopt=full_report&list_uids=1491970

> I also note that in the example given, all the mature peptides
> have nice and simple locations (in terms of their co-ordindates
> for the nucleotides), no ribosomal slippages etc. This means
> grabbing the relevant bits of the genome and translating it is
> also pretty easy (option 2 in your original email).
>
> Have you got a more typical entry you can point us at?
>
> If there is nothing publicly available, I wouldn't mind you
> emailing me one or two to look at off list (and if don't mind,
> they might make good examples for Bio* project unit tests
> or examples).
>
> Peter

I think one could use the full-length protein and run TFASTX (which  
allows frameshifts) against the nucleotide sequence.  The output will  
have the frameshifts designated with '/' or '\', so it would then be a  
matter of splitting the sequence based on the midline, then mapping  
those protein fragments back to the original sequence coordinates.  Is  
this along the lines of what you mean?

chris