[Bioperl-l] locate introns in a protein sequence

Jason Stajich jason at bioperl.org
Tue Mar 8 07:42:54 UTC 2011


I think there are several ways to do this, the easiest is to walk 
through each exon, and count up the codons, and print out the 
translation of the codons so far when you get to the end of an exon, 
keeping the overhang for the next exon.

That's basically what I do here in terms of counting, though I'm mapping 
introns into an alignment not a single peptide (though undocumented so 
it is a bit impenetrable)
  https://github.com/hyphaltip/thesis/blob/master/src/intron/map_introns_aln.pl

Use Bio::DB::SeqFeature::Store will make this much easier.  You will 
want to convert this Broad GTF to GFF3 with this script
https://github.com/hyphaltip/genome-scripts/blob/master/data_format/gtf2gff3_3level.pl

Then load the annotation along with a scaffold GFF3 (like one from this 
script run on your FASTA genome file
  https://github.com/hyphaltip/genome-scripts/blob/master/gbrowse_tools/fasta_to_gbrowse_scaffold.pl
  - just watch out -- the Broad is capitalizing the contig name 
(Supercontig_3.1) while the FASTA file will have it uncapped and 
truncated name (supercont_3.1) so you need to fix that in either of the 
files so it is consistent - a perl one liner will work here;
perl -i -p -e 's/^>supercont3\./>Supercontig_3./' GENOME.fasta

and load this in and you'll then be able to use the Bio::DB::SeqFeature 
to get all genes, walk through them a CDS for each gene, one a time, and 
then just keep count of the frame you are in and the exon length and 
you'll know where the introns go in the protein.

for my $gene ( $dbh->features(-type => 
"gene:S_cryophilus_OY26_V3_CALLGENES_FINAL_2
") ) {
  for my $mRNA ( $gene->get_SeqFeatures('mRNA') ) {
   for my $CDS ( $mRNA->get_SeqFeatures('CDS') ) {
    my $length = $CDS->length;
    # do some stuff to keep count or whatever you need to figure out 
where you are to put the intron
   }
  }
}

you can also see this code which converts folders of GFF3 + genomes into 
coding seqs, translated peps, etc.
  https://github.com/hyphaltip/genome-scripts/blob/master/gene_prediction/gff3_to_cdspep.pl

It all depends on what you want to report from this type of query.

Tao Zhu wrote:
> Hello, everyone. For example, I have a GTF file annotating like this,
>
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 start_codon 25009
> 25011 . + 0 gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 stop_codon 26003
> 26005 . + 0 gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 exon 24828
> 25172 . + . gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 CDS 25009 25172 .
> + 0 gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 exon 25245
> 25364 . + . gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 CDS 25245 25364 .
> + 1 gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 exon 25414
> 26178 . + . gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
> Supercontig_3.3 S_cryophilus_OY26_V3_CALLGENES_FINAL_2 CDS 25414 26002 .
> + 1 gene_id "SPOG_00008"; transcript_id "SPOG_00008T0";
>
> Obviously this transcript "SPOG_00008T0" has two introns.
>
> Also I have a corresponding protein sequence file like this( in fasta
> format),
>
>> SPOG_00008T0 | SPOG_00008 | Schizosaccharomyces cryophilus OY26 exosome
> subunit Rrp45 (292 aa)
> MSKSLEPSANNKGFIVNALKKELRLDGRSLTDFRDLKIEFGEDYGQVDISLGSTRVMARI
> SAEITKPYSDRPFDGIFAITTELTPLASPAFETGRVSEQEVIISRLIEQAIRRSNALDTE
> SLCIISGQKCWSVRASVHFINHDGNLVDAACIAVITGLCHFRRPEITVLGDEVTVHSIEE
> RVPVPLSVLHTPICVTFSFFEDGSLSAIDASLEEEELRTGSMTVTLNKNREICQIFKAGG
> VTIEASSVVACAHTAFQKTTSIISEIQRALDEDLSKKETQFFGGSAENQRS*
>
> I hope to precisely locate these two introns into the protein
> sequence(find their location among the amino acids). Please recommend a
> relatively convenient method. Thank you!
>
>
>

-- 
Jason Stajich
jason at bioperl.org
http://bioperl.org/wiki




More information about the Bioperl-l mailing list