[Bioperl-l] Polyproteins, ribo slippage, and mat_peptide in viruses?
Chris Larsen
clarsen at vecna.com
Tue Oct 27 22:13:22 UTC 2009
Peter, Chris,
Thank you muchly for your expert and well presented dialog. Yes here
is an actual and typical problem in generating protein seq from viral
polyproteins, in the absence of mat_peptide Seq and unique ID:
For record:
http://www.ncbi.nlm.nih.gov/nuccore/112253723?report=genbank
this is the coronavirus:
LOCUS DQ848678 29277 bp RNA linear VRL 12-
SEP-2006
DEFINITION Feline coronavirus strain FCoV C1Je, complete genome.
ACCESSION DQ848678
VERSION DQ848678.1 GI:112253723
Containing a poly protein 1ab where the ribosome stalls at 12358,
backs up one nuc, changes frame, and then continues:
CDS join(311..12358,12358..20391)
/ribosomal_slippage
/codon_start=1
/product="polyprotein 1ab"
and which has multiple, gigantic polyproteins, the child peptides
towards whom we would actually love to focus our comparative
bioinformatic scrutiny, but none of which have mappable IDs or seqs
below the polyprotein level as can be seen:
mat_peptide 15118..16914 <===
/product="nsp13" <===
/note="helicase" <==these are all we have to go
on
and where are given no ID below the polyprotein level, no protein
sequence...just positions. They are nuc positions at that, but we are
given the complete polyprotein seq and have the components to do this
on paper, but no code. In summary we would like to dump in genbank
files to a method, and get out fasta protein files which have some IDs
and seqs.
You guys are forcing me (thank god) to think critically and clearly
about it too, so let me extend the proposed module or method as best I
can:
Input offending Virus Genbank file
For each mat_peptide in a CDS
get nuc coords e.g. 15118..161914
translate it to aa
if slippage then stop translating at end of frame 1 and hold for later
now translate frame 2
$full_translation = $translation_part_1 . $translation_part_2
compare $full_translation to CDS translation e.g "/
translation="MSSKHFKILVNE..."
if identical subsequence then admit as the real, valid mat_peptide
sequence
Annotate with parent Unique ID+product name e.g. 112253723_helicase
go to next mat_peptide
concatenate products to a fasta amino acid file for whole virus
isolate or CDS set for your virus
PHEW.
Chris L
==========
>
> I think one could use the full-length protein and run TFASTX (which
> allows frameshifts) against the nucleotide sequence. The output
> will have the frameshifts designated with '/' or '\', so it would
> then be a matter of splitting the sequence based on the midline,
> then mapping those protein fragments back to the original sequence
> coordinates. Is this along the lines of what you mean?
>
> chris
Let me look into this thank you CF, I have not used that in the past.
More information about the Bioperl-l
mailing list